Project Update: Five Hundred and Thirteen Disambiguations [Davidson Project]

By Sam Hauserman

We hit an important milestone this week in the Davidson Project: the completion of the Eric Davidson Social Network database.

In last week’s project update, we explained how we generated our bibliographic data set, which will provide some of the raw material that we use to analyze relationships within this community of researchers. My contribution to the dataset was to disambiguate the names and identities of all 513 of Davidson’s coauthors.

Because the authors of scientific papers are typically represented only by a (often un-unique) coupling of initials and surname, we need to establish which individual is the real Davidson coauthor so that our bibliographic information is useful. Once we have established the identities of each individual scientist, we can add them to the Conceptpower authority file so that we have an unambiguous way to refer to them in our data. In this week’s project update, I’ll go into a little detail about what the disambiguation process entails.

GOOGLE: Putting A Name To The Name

If I was lucky, the coauthor already had a full first name, or they were an easily searchable established professional. More often than not, identifying the coauthor involved a multi-tab triangulation of several different sources before I became confident that I had found the correct person. Past papers, lab websites, LinkedIn profiles, and even some keyword search guesswork helped me to acquire the bit of basic biographical information that set Stephen C. Benson, the cell biologist, apart from all the other SC Bensons (and South Carolina car dealerships).

One useful method for tracking down a name and information was to look at a scientific publication database. While this worked for many of the coauthors, it wasn’t a consistently foolproof course of action. For example, while PubMed often listed institutional affiliations, they rarely showed the full name of the author. ScienceDirect, on the other hand, had a narrower selection of papers, but usually provided full names as well as institutional affiliation. Sometimes the publication databases wouldn’t have any new information at all, and I had to denominate an individual as coauthor another way.

Different types of challenges emerged throughout the completion of the database. For example, a handful of female coauthors were married and changed their name in between collaborations with Davidson, contributing to ambiguity. Many Chinese, Japanese, and Korean coauthors’ institutional and biographical information were not in English, limiting my ability to make any sort of identification. Some coauthor names were just too common. The coauthors that remained initialed surnames, associated only with their Davidson publication, were those for which no disambiguable information could be found.

VIAF: Linking Up Authority Files

Where possible, an already existing authority file (sort of like a digital catalogue tag for bibliographic material) was linked to the authority file we created for the coauthor. This was done using the Virtual International Authority File (VIAF), a composite authority file library. For most of the coauthors, authority files had not been created, and when there were potential matches, sometimes the descriptions would be too ambiguous (or nonexistent) to confidently link one authority file to the coauthor. However, for 130 of the 513 coauthors, an established authority file could be linked to ours.

CONCEPTPOWER: Creating The Entry

Finally, with (ideally) a full name, a VIAF authority file, and some fragment of biographic information in hand, I could create a “concept” entry in Conceptpower, our online authority file. This entry included information such as part of speech (noun), the concept list to which the entry belonged (Persons), and concept type from the CIDOC Conceptual Reference Manual (E21 Person).

Each of these entries is assigned a Universal Reference Identifier. As the Davidson Project proceeds through various analytic phases, we can use these URIs in the place of simple name-strings so that we can refer to individual researchers unambiguously.