Project Update: Building a Bibliographic Dataset [Davidson Project]

A bibliographic-coupling network built from a sub-set of the Davidson bibliographic dataset. Each node is a scientific publication, and edges indicate which nodes share multiple bibliographic references. The colored nodes indicate the distribution of a “topic” from an LDA-generated topic model.

By Divyash Chhetri; edited by Sam Hauserman.

One of our exploratory projects in computational HPS is an attempt to leverage patterns from large sets of bibliographic and textual data to provide new perspectives on the content and contexts of the investigative pathway of embryologist Eric H. Davidson. Our aspiration is to use a variety of computational approaches to investigate Davidson’s research, drawing on data about coauthorship, citations, and patterns of language in scientific texts. As we go along, we are learning just as much about the tools and methods at our disposal as we are about Davidson’s work.

Our first step was to collect bibliographic data for as many of Davidson’s papers as we could find. In total, we found data for around 540 publications. We then extracted a list of all of Davidson’s coauthors, totaling 515 unique names. Sam has had the arduous task of “disambiguating” each of these names: identifying each individual, collecting basic biographical information about them, and creating an entry for each person in the Conceptpower authority file.

Meanwhile, Divyash has been collecting bibliographic records for each of Davidson’s collaborators. The final dataset is comprised of 16,528 unique bibliographic entries, along with citation data. Of those, 9,838 (mostly post-1990) have abstracts. These data will be used for a variety of citation-based analyses, and the abstracts will be used for topic modeling.

In this update, we describe the methods used to collect these data, as well as some of their potential weaknesses. These considerations will be crucial as we attempt to interpret the patterns that we will generate from these data in the coming weeks.

Obtaining the Data

A Clean Data Set

The majority of the data were obtained through a simple search of the individual’s last name, followed by their first and middle initials, in the Web of Science database. These searches yielded a very reasonable amount of results (typically < 200) when the parameter used was only the last name and initials. When combined with higher search restrictions, such as the selection of specific research domains, the results were further refined. In most cases, the limiting fields used were the research areas of: developmental biology, cell biology, genetics heredity, and biochemistry molecular biology. In a few instances, the fields of microbiology, applied microbiology, and biotechnology were also used.

A Not-So-Clean Data Set

In the cases where the search results yielded upwards of 500 records, a different approach was used rather than the simple ‘search and refine’ method described above. First off, for these individuals, it usually took a little digging to find out what institutions they were involved in. Then, the ‘Author Search’ functionality was used to obtain the records: the full last name, and the initials of their first and middle name, were inputted with the ‘exact matches only’ box checked off as well. Following that step, their research domains were once again used to further refine the results. For most of the searches, the fields used were the same as above (developmental bio, cell bio, etc.), which were all grouped among the larger domain of life sciences biomedicine. This wasn’t always the case, however, since there were some people that were involved in the physical sciences, typically chemistry and physics. Author Search then asks to pick out institutions with which the individual was affiliated (that’s where the preliminary digging came in handy). Taking this approach led to our finding a more accurate number of records that were buried in the simple search’s vast amount of results. Roughly 3/10 records of the total 515 were found in this manner.

The Methodological Obstacles


Despite the use of Author Search, records for some individuals simply could not be found because their full first names were unknown. This presents a problem when names as generic as ‘H Sugiyama’ yields more than 5000 results, which also causes difficulty in pinpointing the institutions that the person is in involved in. There are roughly 10-15 people that fall into this category. Furthermore, some coauthors may have undergone name changes over the course of their scientific career, or ended up with more than one spelling as a product of translation.

Weaknesses of the Data Set

When dealing with such a large volume of information, there will be some amount of errors associated with the dataset. One issue that we consider is whether the limiting fields used actually encompassed all of the person’s work. Using the domains of developmental biology, cell biology, etc. made sense because these disciplines are at least somewhat related, and would likely result in records that the coauthor was involved in. But what if the authors went beyond these disciplinary boundaries and published works in other fields? With the advantages and ubiquity of multidisciplinarity, it is common for a marine biologist to also work with bioinformatics, or for a genomics researcher to later become anything from an opthamologist to a biotechnology developer.