This week is a bit of a mash-up of various things. Tuesday we introduced the idea of a web service, and specifically talked about geocoding web services. To dive a bit deeper into web servics, you may find useful these slides from a recent course on data collection for the humanities at Cambridge. A tutorial on generating geocoded coauthorship and institutional networks can be found here.
Today we're going to dive a bit deeper into producing structured datasets from texts themselves. When we used Named Entity Recognition, we were able to locate instances of particular kinds of entities -- e.g. people, places, and institutions -- in texts. The advantage of approaches like NER is that they require little input from the operator, and can therefore be used on a very large collection of texts without "supervision." One of the downsides of this approach, however, is that although we know that we have found instances of particular kinds of things, we do not know anything about those instances. NER can find names of people, but it can't tell us who those people are. Another downside is that we don't learn anything about the relationships between those entities: we may find two personal names in the same sentence or paragraph, but it is difficult to know precisely how they are related. Also, unsupervised techniques leave no room for differences of interpretation between readers. And there are other issues.
To deal with some of those issues, we've been advocating a 'meso-level' approach that brings human readers back into the mix. This evening, we'll introduce some of the main concepts and components of this approach. You may find it helpful to read this paper, which we presented at a conference last year.