Saturday, April 30, 2016

anecdotally approximating the needs of digital humanity work

(The following is akin to the prologue of a future paper in the digital humanities, a motivator for my research and work.)

Biographically speaking, I am the embodiment of a digital humanities researchers, as I aspire to spend about half of my time in the digital world as a software engineer, and half of my time in the humanities, mostly working on problems of socio-economic history and its impact on religious thought. Moving in both of these worlds makes the disconnect between their tools and capabilities noticeable in a way that I find instructive for the efforts of tooling the digital humanities.

Assume I open my email inbox in the morning and find an email from my fellow software engineer Pace, who has difficulties with a piece of software that I am responsible for. For historical reasons, software engineers have termed these problems "bugs", and the process of bringing this issue to my attention is termed a "bug report". In her bug report, Pace will describe what she did, what the expected outcome was, and what happened instead. Software engineers have a standard method for dealing with these issue: we boil the "bug" down to a small program that verifies that the expected inputs produce the expected outcomes---we call that a "unit test"---and then tweak and modify the existing software until the unit test passes, that is, the program no longer exhibits the erroneous behavior. In doing so, good software projects draw upon their existing unit tests---long-running projects will have tens of thousands of these. Passing all unit tests ensures that eliminating one bug did not introduce any other issues: Primum, non nocere.

Assume that in the afternoon, when I check my other inbox, I find an email from fellow digital humanities researcher Paige. Paige just found a problem with one of the transcribed sources that I shared with her from the archival work that I did for my dissertation. The problem may be very small. Perhaps we can even sort out what the difficulty was and rectify it---the source may now be digitized and accessible on the web. But I have no quick and safe way to incorporate that correction into my dissertation, though it exists as a set of LaTeX files that are themselves easy to change. I have no representation of the argument that my dissertation is making. I have no way of verifying that incorporating Paige's correction is local and will not affect other claims that I make.

Of course, that situation is not very different from the one that I found myself in while writing the dissertation to begin with. It seems statistically implausible that there are no flawed steps in an argument spanning a dozen chapters, a couple hundred of pages and consisting of tens of thousands of words. No manager would hire a developer who wrote a program of that size and had checked its correctness only in their head.

Don't get me wrong: I love lemmatized POS-tagged corpora of major writers as much as the next guy does. But the main product of the humanities is arguments, and crucially counter-arguments, narratives that go against what the public already believes to be the case. Should we not devote the same care to them as software developers toward their code?


Friday, April 29, 2016

some early thinkers in the Digital Humanities

(mostly taken from here)

John Unsworth, now the CIO and Vice-Provost at Brandeis, wrote a lot about the digital humanities, some of it quite early, including musings on the use of knowledge representation and ontologies that mention John Sowa. Unsworth drew inspiration from a Randall Davis paper from AI magazine in 1993 on what knowledge representation is (PS version).

Stephen Ramsay's papers are not accessible from the old links on the DH-website; but his website links to the famous Writing-Programming-Writing dialogue (in four flavors), and has links to the paper on sketching dynamic action (StageGraph), and the 2011 book on Reading Machines, available on Amazon.

Edward Vanhoutte was a key player in the TEI initiative, writing an introduction to the effort and the consortium in 2004.

More such as Julia Flanders, Willard McCarthy, Jerome McGann and Martha Smith to be extracted later.

PS: Apparently even then, Ben(jamin) Ray had realized that the Salem Witch Trials material would make for a nice thing to put into a database, leading to the Salem Witch Trial Project at the University of Virginia in 2002. I thought we mentioned that at the 2007 proposal to NEH ....

not drowning in digital humanities websites

There seems to be no shortage of them, to be honest.

There is even an Alliance of Digital Humanities Organizations to put them all under an umbrella. That alliance used to have an essay section that members of the MONK project recommended as a good introduction, but that can only be found in the Internet Archive/Wayback Machine. (They do have a cool list of publications, however.)

digital humanities on Twitter

Courtesy of Doron Goldfarb a list of good Twitter handles for digital humanities, including a list by Martin Grandjean, who seems to be a key player in the Swiss DH scene.

https://twitter.com/ADHOrg
https://twitter.com/eadh_org
https://twitter.com/dhnow
https://twitter.com/DH_UniWien

http://www.martingrandjean.ch/digital-humanities-on-twitter/
https://twitter.com/GrandjeanMartin/lists/digital-humanities

applying TEI modeling to financial information

The MEDEA project (an acronym for Modelling semantically Enriched Digital Edition of Accounts) was pointed out to me by Paige Morgan (University of Miami, Florida) as an effort that she is involved in. It would be fun to take the accounts from the steam-boat "Maid of Iowa" and transcribe them into their TEI format, once that encoding format has been fully defined.

Paige also pointed me to the Guelph workshop on Digital Humanities where she is teaching for four days from May 9-12, 2016. Here's wishing her all the best.

digital humanity tools

I looked at WordHoard today, a tool spearheaded by Martin Mueller of NWU, one of the grandmasters in the TEI text-encoding initiative (he was also behind the MONK project).

I was struck by some of the cool aspects of WordHoard, such as the stemming and POS infrastructure and the Bean scripting window, as well as some of the weirdness (MySQL-backed hibernate objects that require full rebuild for every XML edit?).

Some of the linguistic services available from MONK are incorporated in the MorphAdorner that is still available on BitBucket (with cool online examples hosted at NWU). There is also a collaboration with the Lucene-based BlackLab index.

The TEI community seems to be big fans of the Oxygen XML editor, but I can handle a lot of pain for 240US$, to be quite honest about it.

I also noticed that the next TEI conference is in Vienna, in September, and that they have a call for papers out now. Maybe I can think of something cool to write on.

Wednesday, April 20, 2016

On estimation of confounders

I came to this paper on controlling for statistical confounders via Scott Alexander's Slate Star Codex website, which covers statistical topics.

Westfall and Yarkoni remind us that we use proxies to estimate the true latent constructs, e.g. a "specific survey item asking about respondents’ income bracket" to estimate "socioeconomic status". Thus, statistical arguments cannot easily control for the potentially confounding latent construct, but only for a measure of that construct, which has its own error brackets.

At the root of the problem lies the insight countervening conventional wisdom that 
The relationship between n and Type 1 error may be less obvious: all else equal, as sample size increases, error rates also increase.
The problem is exacerbated by the way reliability interacts with error rates, giving a non-monotonic relationship.
In the middle [of the reliability range, RCK], however, there exists a territory where effects are large enough to afford detection, but reliability is too low to prevent misattribution, leading to particularly high Type 1 error rates.
Furthermore, this problem is independent of the statistical analysis approach used (frequentist vs Bayesian, parameter estimation) as long as reliability is not explicitly accounted for.

Indeed, somehow getting a grip on the reliability is the core of the problem. But that is easier said than done. Westfall and Yarkoni point out that
econometric studies attempting to control for SES [i.e. socioeconomic status, cf. above, RCK] hardly ever estimate or report the reliability of the actual survey item(s) used to operationalize the SES construct ....
This is why Westfall and Yarkoni recommend structural equation modeling (SEM) in order to avoid the trap altogether. Lacking any estimations for the reliability, the researchers can then at least plot their results across a range of estimates to see how sensitive their findings are on reliability.

Bibliographic Record:
Westfall J, Yarkoni T (2016) Statistically Controlling for Confounding Constructs Is Harder than You Think. PLoS ONE 11(3): e0152719. doi:10.1371/journal.pone.0152719

Sunday, April 17, 2016

Nelson Goodman on Ampliative Inference

By ampliative inference is central to many historiographical arguments, such as induction by enumeration, analogical reasoning, inference to the best explanation, and other forms that expand on what is available in the premise.

However, it is in these contexts precisely that projectibility can be lost. In my current understanding this has to do with discontinuities in the underlying space, as the bleen and grue examples show that Nelson Goodman introduced in the 1940s and revisited in his 1984 paper New Riddle of Induction in Fact, Fiction and Forecast. Thus, while
  1. All observed emeralds are green.
  2. Thus, all emeralds are green.
is valid, the form
  1. All observed emeralds are grue.
  2. Thus, all emeralds are grue.
fails, the fact that the definition of "grue" contains the notion of "observed before the year X" in its definition means that the property of "grue" is insufficiently general for projectability. 

Chris Swoyer, who wrote the SEP article used here, argues that the grue problem is "just an instance of the ubiquitous underdetermination of hypotheses by finite bodies of data". And such underdetermination is a problem for inductive inference, unless the induction is controlled in some fashion, i.e. some hypothesis are rejected from the inductive step. Restriction to projectable relations is one way to tackle the problem.

In his 3rd Lecture, Goodman looks at the way that justification and inference are related, and states a co-dependency that is virtuously circular:
A rule is amended if it yields an inference we are unwilling to accept; an inference is rejected if it violates a rule we are unwilling to amend. (p.67)
Thus:
The process of justification is the delicate one of making mutual adjustments between rules and accepted inferences; and in the agreement achieved lies the only justification needed for either. (p.67)
Similarly for inductive inference.
An inductive inference, too, is justified by conformity to general rules, and a general rule by conformity to accepted inductive inferences. Predictions are justified if they conform to valid canons of induction; and the canons are valid fi they accurately codify accepted inductive practice. (p.67)
Applying this to inductive work, which is neglected as compared to deductive inference, is what Goodman calls the Constructive Task of Confirmation Theory (p.68). However, just as with defining any term, we tweak the usage and the definition to make sure that the definition picks out only the instances of usage an no others (p.69).

Goodman's discussion of the black and the raven things starting on pp.70ff is dependent on Carl Gustav Hempel's Studies in the Logic of Confirmation, from: Mind, v.54 n.213 (January 1945), pp.1-26, where Hempel (p.9ff) pulls apart Nicod's criterion for confirmation or invalidation, as Nicod termed it. The basic point is that the basic structure of a rule in its logical form as disjunct no longer separates the assumptions of the hypothesis from the predictions of the hypothesis. This has the effect that logically equivalent hypotheses (R => B and !B => !R) are supported or neutral with respect to the same observations (e.g. the observation !R & !B supports the second but not the first hypothesis). Clearly independence of formulation is an important desideratum (Hempel 1945, p.12).

[Note: I could not find an easy simplification rule that allows to rewrite S1 in the multi-variable case (Hempel 1945, p.13 Fn) into S2, i.e. ~R(x,y) => R(x,y) & ~R(y,x) into R(x,y), but the truth tables show that this is so.]
Taking a leaf from Carl Gustav Hempel, Goodman points out that the only substitutions in the hypothesis step can be the variables for the objects, not the relations (p.71), or put differently:
... the hypothesis says of all things what the evidence statement says of one thing .... (p.71)
The central idea for an improved definition is that, within certain limitations, what is asserted to be true for the narrow universe of the evidence statements is confirmed for the whole universe of discourse. (p.73) 
 (to be continued)

Berger on the Sacred Canopy -- I.4 Alienation

(<< Chapter 3 -- Theodicy || Chapter 5 -- Secularization >>)

Religion and Alienation

Berger now re-iterates the difference between a physical event and an event with social meaning, taking the example of an accident versus an execution (pp.81f).
The individual may "co-operate" in the execution in a way in which he never can in the accident---namely, by apprehending it in terms of those objective meanings he shares, .... Thus the victim of an execution can die "correctly" in a way that would be more difficult for the victim of an accident. (p.82) 
Giving an example involving polygamy and monogamy (pp.82f), Berger reiterates that the fantasies are lesser realities because of the objective social reality is internalized; even the individual feels real only in his internalized role.
What concerns us here is simply the important fact that the social world retains its character of objective reality as it is internalized. It is there in consciousness too. (p.83)
This consciousness will decompose, analytically speaking, into a socialized and a non-socialized component (p.83).
... the duplication of consciousness [t.t., RCK] brought about by the internalization of the social world has the consequence of setting aside, congealing or estranging one part of consciousness as against the rest. (p.83)
In other words, the duplication of consciousness results in an internal confrontation between socialized and non-socialized components of self, reiterating within consciousness itself the external confrontation between society and the individual. (p.84)
Just as some of the things that man does become part of the objective reality of society and "escape" him (p.85), so even part of his own self escapes, the part shaped by socialization.
As a result, it becomes a possibility not only that the social world seems strange to the individual, but that he becomes strange to himself in certain aspects of his socialized self. (p.85) 
There are time when the externalized self can be reappropriated; but wherever that fails, and the "socialized self confronts the individual as inexorable facticities analogous to the facticities of nature", we can call this process "alienation" (p.85).
... alienation is the process whereby the dialectical relationship between the individual and his world is lost to consciousness. The individual "forgets" that this world was and continues to be co-produced by him. (p.85)
The effect is that social world and natural world merge, because the constructed character of the social world is lost in the alienated, undialectical and false consciousness (p.85). This leads to an inversion of the relationship between man and world and to a loss of meaning.
The actor becomes only that which is acted upon. The producer is apprehended only as product. In this loss of societal dialectic, activity itself comes to appear as something other---namely, as process, destiny, or fate. (p.86)
Berger emphasizes that such alienation is a state of consciousness, a typical stage in onto- and phylogenetic development (p.86), but is different from anomy (p.87), because if anything at all the world as opus alienum is "seemingly everlasting" (p.87).

Berger now observes that the success of religion as "the most effective bulwark against anomy throughout human history" lies in its "alienating propensity" (p.87) due to the numinous presenting itself as the totaliter aliter (p.87).
... the ultimate epistemological status of these reports [of an other reality somehow impinging or bordering upon the empirical world, RCK] of religious men will have to be rigorously bracketed. (p.88)
... whatever else the constellations of the sacred may be "ultimately", empirically they are products of human activity and human signification---that is, they are human projections. (p.89)
Thus because these produced projections of meaning are experienced as external to the individuals, they are alienated projection (p.89).
The fundamental "recipe" of religious legitimation is the transformation of human products into supra- or non-human facticities. (p.89) 
The humanly made world is explained in terms that deny its human production. (p.89)
The human nomos becomes a divine cosmos, or at any rate a reality that derives its meanings from beyond the human sphere. (p.89)
Berger believes that "simply equating religion with alienation"  is too much, as that would "entail an epistemological assumption inadmissible within a scientific frame of reference" (p.89), but is prepared to contend that
the historical part of religion in the world-building and world-maintaining enterprises of man is in large measure due to the alienating power inherent in religion. (p.89)
Again Berger insists that "the presence in reality of beings and forces that are alien to the human world" is an assertion that "in all its forms, is not amenable to empirical inquiry" (p.89).  [[RCK: My philosopher friends would assert that the onus is on the side of religion to provide the empirical proof, not on the side of modern agnostic science. In general, we should eschew entities that are not amenable to empirical inquiry in our theories, even if some entities are very a]]
What is amenable, though, is the very strong tendency of religion to alienate the human world in the process. (p.89)
Thus, religion has contributed to the mystification of the world by allowing man to live with a false consciousness (p.90).
The socio-cultural world, which is an edifice of human meanings, is overlaid with mysteries posited as non-human in their origin. All human productions are, at least potentially, comprehensible in human terms. (p.90)
Berger then turns to the example of marriage and kinship, noting that all societies have worked out "more or less restrictive `programs` for the sexual activity of its members" (p.90). One way to accomplish faithful adherence
is to mystify the institution in religious terms. (p.90)
(to be continued)

Bibliographic Record

Peter L. Berger, The Sacred Canopy: Elements of a Sociological Theory of Religion, Garden City, NY (Double Day) 1967.

William Heth Whitsitt's Sidney Rigdon MS, (Part 1)

William Heth Whitsitt's Sidney Rigdon MS, (Part 1) a long paper on how Rigdon must be the author of the Book of Mormon, based on the alignment with Disciple theology, by a Southern Baptist theologian.

I ended up not using this in my book.

the possibility of digitally supporting argumentation in the humanities

As noted in my previous post,  while the decision process regarding the quality of arguments in the digital humanities is complicated and usually involves mainly sins of omission, the basic model of an argument being open to confirmation makes for a sensible launching point to bring DH support to bear.

It is important to understand that at the level of generality under discussion here, there is no principal bias toward formal models of argumentation. By this I mean that there is no bias toward argumentation whose rules of inference have already been formalized (such as first-order predicate calculus or decision logic or similar). However, it seems to me that the interest in the ability of others to follow our arguments implies that the rules of inference I use in my argument could at least in principle be formalized. If such a "mechanical" or "symbolic" notion of inference is admitted, then Turing has good news for us: such a process of validation of an argument is a Turing-computable function and we can bring effective digital support to bear on the problem.

On the one hand it seems clear that, within the schools of the various disciplines of the humanities, rules for what are valid and proper forms of argumentation (and what are not) have developed and are taught successfully to the new generation. (This is the direction that Thomas Kuhn's arguments about the paradigms of scientific practice are hinting toward.) On the other hand, most humanities scholars would be hard-pressed to successfully formalize their underlying rules of inference. They may even have difficulties to press the forms of argumentation of their school of thinking correctly into another formalized system of inference---such as first-order predicate logic or decision logic.

The situation appears perhaps even more difficult when faced with heuristics, by which I mean argument constructs that are rule-like but are mutually exclusive and are known to be only valid in some situations. Perhaps best-known in the humanities are the rules for cross-document textual interpretation when reconstructing an Urtext, such as lectio difficilior or lectio brevior.  These heuristics come into conflict if the shorter text eliminated difficulties.

But even if we are admonished not to apply these heuristics mechanically, the fact that their weighing against each other could be taught suggests that they can be explained and in the limit applied mechanically by those wishing to understand the process of our weighing, that is our argument.

Sunday, April 10, 2016

digital support for argumentation in the humanities

Perhaps it is especially obvious to someone who has just published his first book that monographs are large complex theoretical undertakings that could benefit from more digital support than word processing, map drawing, or Google and Amazon Kindle books (as searchable sources) provide---however helpful all of these things are!

At the same time, these analogous tools give an appreciation for the limitations of the support one should expect realistically: for example, it took about two weeks of low-level LaTeX tweaking to get my monograph just right, with judicious page stretching and negative-space fiddling and, in the end, we were stuck splicing a page-break into a generated file (the output from the indexer) in order to get the look "just right" (TM by Goldilocks)---a computer-science no-no if there ever was one!

The basic idea behind argumentation in the humanities is simple enough: I would want other people, given the data that I used, and applying to these the analysis that I applied, to come to the same conclusion that I arrived at. In the limit, that process should be effectively mechanical.

The practice of actual scholarly argumentation is of course vastly more complicated, where people often challenge arguments, by

  • challenging the data selection, meaning that there is other data that should have been used
  • challenge the data use, meaning that other features of the data set should have been used 
  • challenging the models (scripts, in the language of Schank and Abelson) that are used to interpret people's actions 
  • etc
The implied criticism is always that these differences in strategies would have led to differences in the argument outcomes (everything else would be supporting evidence). This means that an argument that can be followed and agreed to under its premises is only a small part of the scientific process in the humanities. 

While this is true,  it also stands to reason that the infrastructure developed for digitally supporting the construction and maintenance of humanities arguments and narratives are a good launching pad for digitally supporting the comparison of different arguments and narratives in the humanities. 
So, while effort expended in the digital support of humanities argumentation will not solve "the problem", it is an interesting incremental improvement of supporting ever more of the scientific work of the humanities digitally.