Exploring Data Learning With Public Medical Databases

Tracking Herd Mentality In Citation Graphs of Psychiatry Research Papers

Curtis W. Moore

School of Janitorial Sciences and Sandwich Making
November, 2016
Madison, WI

One method of ranking scientific papers and their authors is by comparing counts of other published papers that cite them. The material for this analysis is readily available through PubMed.

Tracking influence in this way ascribes a higher value to those papers that have many citations and an even higher value to those with numerous "grandchild" citations--in those which have many citations, the citing child papers themselves are cited to a high degree. Thus ordered, some articles may indicate a new field or method of research, influence on influencers.

My hypothesis is that papers which have been frequently cited, and their authors, in a form of psychiatry that gains significant traction, but later is overturned, will display a recognizable signature in their citation graphs, and that this pattern may be exposed through standard classification schemes.

In other words, distinct and recognizable paths emerge. Changes may signify a new drug, and new drugs may trigger patterns in the literature. A particular lab may become prominent, or dormant, or a small group of scientists may quickly come to press with many related applications of a novel technique.

The first step is to generate a graph of the citations: begining with a broad search somewhere in the middle of the time range, for each paper in the corpus, then for each paper cited, locate or create the corresponding entries. Increment the citations count, store the source or journal name, authors list, publication date, also store the papers cited--each of the cited papers are then examined recursively in like manner.

For each paper traversed in this graph, maintain values and update them according to each new paper analyzed. Note that when a previously encountered entry returns to focus, we should find an existing, resolved record tree.

After a limit of node traversals has completed, we continue with another broad sample, from a somewhat later date. Once an upper bound of practical computability is reached, the gathering phase is halted.

We prepare counts to compute summary statistics. The program will cross-sort the papers by attribute information: populate foreign key tables and sort by attribute values. Map 'paper' identifiers onto their attribute values in an ordered, indexed lookup table.

How many papers rank higher than average for total citations in their time duration? How many of these have a relatively high number of first-degree citations, but relatively lower by second degree? How much of this is a factor of the institutional affiliation or authors?


  1. "Schizophrenia: A brother finds answers in biological science" by Ronald Chase; https://www.amazon.com/Schizophrenia-Brother-Answers-Biological-Science/dp/1421410915 (inspired by pages 1-3 of the prologue)
  2. Hacking On the PubMed API, Trotter; http://www.fredtrotter.com/2014/11/14/hacking-on-the-pubmed-api/
  3. PubMed Entrez API (national science research database with open programmatic access)

Appendix A:

Some Machine Learning and NLP resources:

Appendix B:

An example search call to the PubMed database from Trotter's article, in text/JSON listing the numeric identifiers of 1000 journal articles related to breast cancer:


© 2016 Curtis W. Moore, techbio