Madeleine Gregory, AI Trained on Old Scientific Papers Makes Discoveries Humans Missed, Vice, July 9, 2019:
Using just the language in millions of old scientific papers, a machine learning algorithm was able to make completely new scientific discoveries.
In a study published in Nature on July 3, researchers from the Lawrence Berkeley National Laboratory used an algorithm called Word2Vec sift through scientific papers for connections humans had missed. Their algorithm then spit out predictions for possible thermoelectric materials, which convert heat to energy and are used in many heating and cooling applications.
The algorithm didn’t know the definition of thermoelectric, though. It received no training in materials science. Using only word associations, the algorithm was able to provide candidates for future thermoelectric materials, some of which may be better than those we currently use.
“It can read any paper on material science, so can make connections that no scientists could,” researcher Anubhav Jain said. “Sometimes it does what a researcher would do; other times it makes these cross-discipline associations.”
To train the algorithm, the researchers assessed the language in 3.3 million abstracts related to material science, ending up with a vocabulary of about 500,000 words. They fed the abstracts to Word2vec, which used machine learning to analyze relationships between words.
The original research article:
Vahe Tshitoyan, et al. Unsupervised word embeddings capture latent knowledge from materials science literature, Nature Volume 571, pages95–98 (2019)
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.