History of the Max Planck Society — Department Baldwin - Introgression in Co-authorship Networks

By Erick Peirson

An important component of analyzing the causes and dynamics of conceptual change in science is understanding the behavior and influence of individual scientists, in the context of their collaborations and discursive activity. Fleck's concept of Denkkollectiv drew attention to the ways in which patterns of collaboration give rise to specialized Denkstil -- patterns of thought, language, and practice that constitute the lens through which scientists see the natural world and ask questions about it. Consistent with our everyday experience in social situations, Social Network Analysis has shown how power and influence are distributed unevenly among individual actors in collectives, shaping the flow of ideas and information in those social networks. Graph theory gives us a rich collection of concepts and metrics to express such influence quantitatively, based on the structural properties of networks. As historians we are interested not only in the structure of particular social networks, but how those networks evolve. With respect to analyzing the behavior and influence of individual actors, this prompts us to ask how different scientists enter existing collaborative networks, and how their structural position within those networks change over time.

One way to pursue this question is to turn to one of the most readily-available sources of data for estimating collaborative networks in science: bibliographic records from the research literature. We are aware of the shortcomings of this particular data source, but as a test-case we looked at the collaborative behavior of chemical ecologist Ian T. Baldwin, founding director of the Max Planck Institute for Chemical Ecology. Baldwin started publishing in 1981, and around 1/6 of his publications are in the journal Plant Physiology. He started publishing in that journal in 2001. We started by downloading records from the Web of Science for all of the papers published in Plant Physiology from 1999 through 2013. We started a few years before Baldwin started publishing in that journal, so that we could scrutinize the point at which he became active. We loaded the WoS data in Tethne (a Python module developed by the ASU Digital Innovation Group), and generated a coauthorship network.

Using the whole bibliographic dataset yielded a coauthorship network with over 19.5k nodes, which was a bit unwieldy. So we limited the network to the first four years of the dataset (1999 - 2002), and visualized the resulting coauthorship network in Cytoscape. A fairly consistent result in the analysis of coauthorship networks is that such networks feature a single, monstrously large connected component (containing ~50% of the nodes in the whole graph) and many smaller, peripheral components (such networks also tend to be scale-free). In this early time-slice, we found Baldwin in one of the peripheral components (fig 1). Sliding the time-window forward one year (to 2000 - 2003), we noticed that Baldwin had moved from a small, disconnected component to the periphery of the major component (fig 2). In subsequent years, Baldwin moved further into the core of the main component (fig 3).



Figure 1. Coauthorship network from the journal Plant Physiology, 1999 - 2002 (inclusive). Ian Baldwin is represented by a red node, and his coauthors are represented as blue nodes. The whole graph is shown at right, and the focal region is boxed.


Figure 2. Coauthorship network from the journal Plant Physiology, 2000 - 2003 (inclusive). Ian Baldwin is represented by a red node, and his coauthors are represented as blue nodes. The whole graph is shown at left, and the focal region is boxed. Baldwin is now part of the main component, situated near the periphery.


Figure 3. Coauthorship network from the journal Plant Physiology, 2007 - 2010 (inclusive). Ian Baldwin is represented by a red node, and his coauthors are represented as blue nodes. Baldwin is now embedded deep within the main component, with neighbors spread broadly throughout the graph.

 

So what's going on here?

Based solely on visual inspection it appears that, over time, Baldwin is becoming more deeply embedded in this collaborative network. Our impression is that he is assuming a more central, established position within this community of researchers. This increasing centrality might give Baldwin a more prominent role in controlling the flow of ideas, or influencing other researchers with his own ideas. But visual inspection can be misleading, especially when it comes to large hair-balls such as this one. What we really want is a metric that lets us express that intuition quantitatively, and to ask more specific questions about Baldwin's apparent progressive introgression into this network.

The essence of our intuition about Baldwin's movement into the "core" of this network is that he is becoming more closely connected to an increasing number of researchers in this field. Graph theory gives us a useful concept for measuring this kind of centrality, closeness centrality. The closeness centrality of a node is based on the lengths of the shortest paths between that focal node and all other nodes in the network, and is usually calculated as:

( eq. 1 )

where i is the index of the focal node, and dij is the length of the shortest path between i and the jth node for all other nodes in the network.

Calculating closeness centrality in this way works well for networks with a single connected component. But as soon as a second component is introduced, the equation above starts to break down. If two nodes i and j belong to different components, then the shortest path between them is infinitely long. In this case, the denominator of eq.1. is always inifinty, and the closeness centrality of all nodes is effectively 0.

Tore Opsahl suggested a work-around for this problem in 2010, by summing the inverse shortest path lengths (rather than taking the inverse of the sum of shortest path lengths). Taking advantage of the fact that R (the statistical analysis package) interprets 1/infinity as 0, he calculated closeness centrality as:

( eq. 2 )

Tethne is built on top of the Python package NetworkX, which provides a large library of methods for network analysis. Not wanting to reinvent the wheel, we started looking for a way to emulate Opsahl's workaround while still using NetworkX's methods to do the hard work. When provided with a source node only, the networkx.shortest_path_length() method returns the shortest path length for each node in the same component as that focal node. Since we're interested in the relative centrality of nodes, it wasn't important for to arrive at precisely the same values as Opsahl -- just the same relative magnitudes of values between nodes. Here's the approach we settled on:

( eq. 3 )

where k is the index of the kth node in the set of nodes in the same component as focal node i, and N is the total number of nodes in the network (all components).

This makes implementation in Python a breeze:

 

A first attempt

To see whether this metric reflected our intuition about Baldwin's progressive introgression into the Plant Physiology coauthorship network, we calculated global closeness centrality just as before: using a sliding 4-year time-window, starting in 1999. Just as we believed we had observed, Baldwin's closeness centrality rose rapidly between the first and second time-window (moving from a small component to the periphery of the main component), and grew steadily over time (Fig. 4). Since the size of the network fluctuated somewhat between time-windows, and we assumed that the overall structure of the network might change over time, we also calculated the average global closeness centrality in each time-window (fig. 5). Normalizing Baldwin's global closeness centrality in each time-window makes for more defensible comparisons over time.



Figure 4. Global closeness centrality of Ian T. Baldwin in a coauthorship network from the journal Plant Physiology, 1999 - 2013, using a 4-year sliding time window.


Figure 5. Normalized global closeness centrality of Ian T. Baldwin in a coauthorship network from the journal Plant Physiology, 1999 - 2013, using a 4-year sliding time window.

 

Figures 4 and 5 indicate that by the second time-window Baldwin's level of global closeness centrality was far above average for the Plant Physiology coauthorship network, and grew over time. What these figures do not reveal, however, is how typical (or not) Baldwin's overall trajectory is for that network. In other words, we don't know how often it is the case that scientists show a steady increase in closeness centrality over time. Figure 6 shows a first attempt at comparing Baldwin's trajectory to those of other nodes in the network.



Figure 6. Normalized global closeness centrality of Ian T. Baldwin and twenty other randomly-selected researchers in a coauthorship network from the journal Plant Physiology, 1999 - 2013, using a 4-year sliding time window.

 

Clearly there is a great deal of variation in these trajectories. What we would like to do now is find a simple way to characterize the overall trend in a node's global closeness over time. Some nodes move briefly into a position of centrality, and then fall away just as rapidly -- these are likely scientists who publish only occasionally in this journal, and therefore only occasionally show up in the coauthorship network. This points to one of the limitations of using coauthorship to estimate actual collaborative networks. It is rare that a scientist publishes in one journal only. To make robust inferences about the structural properties of nodes in coauthorship-based social networks, we need a theory that connects journal choice to the underlying socio-disciplinary space that we are trying to analyze. Short of that, simply increasing the number of journals used in the analysis will increase our confidence in the interpretability of results like the ones discussed here.

We also added a method to Tethne for conducting an analysis like the one described above.

Project_: 
investigative-pathways