Exploring Data Learning With Public Medical Databases
Tracking Herd Mentality In Citation Graphs of Psychiatry Research Papers
One method of ranking scientific papers and their authors is by comparing counts of other published papers that cite them. The material for this analysis is readily available through PubMed.
Tracking influence in this way ascribes a higher value to those papers that have many citations and an even higher value to those with numerous "grandchild" citations--in those which have many citations, the citing child papers themselves are cited to a high degree. Thus ordered, some articles may indicate a new field or method of research, influence on influencers.
My hypothesis is that papers which have been frequently cited, and their authors, in a form of psychiatry that gains significant traction, but later is overturned, will display a recognizable signature in their citation graphs, and that this pattern may be exposed through standard classification schemes.
In other words, distinct and recognizable paths emerge. Changes may signify a new drug, and new drugs may trigger patterns in the literature. A particular lab may become prominent, or dormant, or a small group of scientists may quickly come to press with many related applications of a novel technique.
The first step is to generate a graph of the citations: begining with a broad search somewhere in the middle of the time range, for each paper in the corpus, then for each paper cited, locate or create the corresponding entries. Increment the citations count, store the source or journal name, authors list, publication date, also store the papers cited--each of the cited papers are then examined recursively in like manner.
For each paper traversed in this graph, maintain values and update them according to each new paper analyzed. Note that when a previously encountered entry returns to focus, we should find an existing, resolved record tree.
After a limit of node traversals has completed, we continue with another broad sample, from a somewhat later date. Once an upper bound of practical computability is reached, the gathering phase is halted.
We prepare counts to compute summary statistics. The program will cross-sort the papers by attribute information: populate foreign key tables and sort by attribute values. Map 'paper' identifiers onto their attribute values in an ordered, indexed lookup table.
- To normalize with historical periods, group papers with others of their era in buckets by publication date ranges.
- Prepare scores for pattern analysis such as the total number of citations, counts of second or nth level sub-citations, ratios of these, and so on.
- Another computed attribute to consider might be a measure of "velocity". Velocity is defined as the number of citations by other papers in a given time frame. Time duration for each paper at rate or influence is from the publication date until it is no longer cited above some threshold.
- Various combinations and permutations of these sorted lists are computed, a paper's percentile placement in bucketing analysis of sort order will (or will not) signal a potential classification or cluster.
How many papers rank higher than average for total citations in their time duration? How many of these have a relatively high number of first-degree citations, but relatively lower by second degree? How much of this is a factor of the institutional affiliation or authors?
- "Schizophrenia: A brother finds answers in biological science" by Ronald Chase; https://www.amazon.com/Schizophrenia-Brother-Answers-Biological-Science/dp/1421410915 (inspired by pages 1-3 of the prologue)
Hacking On the PubMed API, Trotterhttp://www.fredtrotter.com/2014/11/14/hacking-on-the-pubmed-api/
PubMed Entrez API (national science research database)
Open programmatic access
Some Machine Learning and NLP resources:
Practical advice for analysis of large, complex data sets unofficialgoogle datascience/2016/10/practical-advice-for-analysis-of-large.html
Natural Language Toolkithttp://www.nltk.org/
Vector Semantics andEmbeddingshttps://web.stanford.edu/~jurafsky/slp3/6.pdf
Choosing the right estimatorhttp://scikit-learn.org/stable/tutorial/machine_learning_map/
Ta m i n g Te x tmanning-content.s3.amazonaws.com /download/ /Sample-ch01.pdf
pubmed.mineR: Text Mining of PubMed Abstractshttps://cran.r-project.org/web/packages/pubmed.mineR/index.html
An example search call to the PubMed database from Trotter's article, in text/JSON listing the numeric identifiers of 1000 journal articles related to breast cancer:
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=1000&term=( breast neoplasms MeSH Terms OR ( breast All Fields AND neoplasms All Fields ) OR breast neoplasms All Fields OR ( breast All Fields AND cancer All Fields ) OR breast cancer All Fields ) AND (Review ptyp AND jsubsetaim text )
© 2017 Curtis W. Moore