Semi-automation of large scale scientific literature review
About the project
We have collaborated with people from The Nature Conservancy to accelerate their research on Natural Climate Solutions.
Location:
United States
Industry:
Climate research
Services:
Research acceleration
Business type:
University / Non-Profit
Challenges
Evidence maps are important to guide environmental actions, such as using NCS to mitigate climate change. The problem though is that the default approach to building evidence maps is highly manual. There is no human way to review the millions of papers that could be useful. It might take a decade even for a big research team to read millions of papers. On the other hand, a fully machine learning approach is also unsuitable because there is no pre-existing, clean, robust labeled data.
Solution
Working with the Client, Lexunit co-developed a semi-supervised approach that uses machine learning to filter and organize data in a principled way and then use expert review on a representative sample of the data.
A 5-step pipeline was created that combines machine learning and human review:
- Search and data collection
- Pre-filter articles
- Topic modeling
- Expert review
- Extract variables
The first step was to test search strings to extract data from Web of Science and Scopus, which yielded almost 2.3 million unique papers. Then these articles were pre-filtered using machine learning, which resulted in 1.28M papers. After a thorough investigation, the Client and Lexunit decided that topic modeling is the most suitable solution for the problem at hand, which is an unsupervised machine learning approach.
First, we transformed the text data of the 1.28M paper’s abstracts into numeric data using large language models (LLMs), and then topic modeling created clusters of documents. This provided a structured way to look at the data and made expert review feasible. Natural Language Processing (NLP) was employed to extract the top 40 keywords, and keyphrases from the documents that belong to the same topic, and 10 of the most representative papers were selected. The team then worked together to categorize topics into NCS and co-benefits. Lexunit then started to analyze these papers and extracted several variables from them.
Named Entities were identified in the abstract text, and they were geolocated which placed them on the map (see figure above). Then the geolocations were intersected with the IUCN Global Ecosystem Typology polygons to determine the biome as well. To extract biodiversity information, the papers’ abstracts were checked against all Latin binomials from the Open Tree of Life database, which contains more than 4M entities.
Additionally, cost information was extracted and papers that are related to indigenous people and local communities were identified.
Results
The end result of our collaboration was
(i) a cloud database that contains 120+ data fields for each collected paper, including their metadata and machine learning extracted variables; and
(ii) an automatic, repeatable pipeline that reduces the time required to carry out evidence map creation from years to weeks.
The data processing pipeline is connected to the database and can be used to create an entirely new evidence map or update the existing one. The Client’s goal is to periodically update the evidence map by incorporating the latest literature, and therefore this solution offers great value to them. Furthermore, the expert review phase can be skipped, because the pipeline is built in such a way that is able to reuse past knowledge to determine the topic and categories of new data.
Finally, using our solution the Client was able to conduct a large-scale review of the scientific literature that targets NCS and their human well-being and biodiversity co-benefits. Based on their findings and our methodology, together we drafted a manuscript that is currently being reviewed by top scientific journals in the world.
Case studies
Let's build something!
We’d like to hear more about you and what you have in mind.