Amir
Amir

Reputation: 181

Extract abstract / full text from scientific literature given DOI or Title

There are quite a lot of tools to extract text from PDF files[1-4]. However the problem with most scientific papers is the hardship to get access PDF directly mostly due to the need to pay for them. There are tools that provide easy access to papers' information such as metadata or bibtex , beyond the just the bibtex information[5-6]. What I want is like taking a step forward and go beyond just the bibtex/metadata:

Assuming that there is no direct access to the publications' PDF files, is there any way to obtain at least abstract of a scientific paper given the paper's DOI or title? Through my search I found that there has been some attempts [7] for some similar purpose. Does anyone know a website/tool that can help me obtain/extract abstract or full text of scientific papers? If there is not such tools, can you give me some suggestions for how I should go after solving this problem?

Thank you

[1] http://stackoverflow.com/questions/1813427/extracting-information-from-pdfs-of-research-papers
[2] https://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf
[3] http://stackoverflow.com/questions/6731735/extracting-the-actual-in-text-title-from-a-pdf?lq=1
[4] http://stackoverflow.com/questions/14291856/extracting-article-contents-from-pdf-magazines?rq=1
[5] https://stackoverflow.com/questions/10507049/get-metadata-from-doi
[6] https://github.com/venthur/gscholar
[7] https://stackoverflow.com/questions/15768499/extract-text-from-google-scholar

Upvotes: 18

Views: 14455

Answers (5)

Paweł Kolendo
Paweł Kolendo

Reputation: 19

I have made a python code that works for most cases. Sometimes there is a connection error and the function get_abstract_from_doi should be run with try except method.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import requests
from urllib.parse import urlparse

def main():
    URL = "https://doi.org/10.1016/j.compstruc.2012.09.003" # Specify the DOI here
    for _ in range(7):
        try:
            print(get_abstract_from_doi(URL))
        except:
            pass

def get_abstract_from_doi(doi):
    r = requests.get(doi,allow_redirects=True) # Redirects help follow to the actual domain
    
    # Setup Selenium WebDriver
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
    
    driver.get(r.url)
    
    # Find all elements that have "abstract" in their text, regardless of nesting
    elements = driver.find_elements(By.XPATH, '//*[contains(text(), "bstract:")]')
    if len(elements)==0:
        elements = driver.find_elements(By.XPATH, '//*[contains(text(), "bstract")]')
    
    # Extract text content while ensuring minimum length
    elements = [
        elem for elem in elements if len(elem.text.strip()) >= 1
    ]
    element=min(elements, key=lambda x:len(x.text.strip()))
    characters=len(element.text.strip())
    while len(element.text.strip())<2*characters:
        element=element.find_element(By.XPATH,"./..")
    abstract_text=element.text.strip()

    driver.quit()
    
    return abstract_text
if __name__ == "__main__":
    main()

Upvotes: 0

Ferroao
Ferroao

Reputation: 3056

Using curl (works in my linux):

curl http://api.crossref.org/works/10.1080/10260220290013453 2>&1  | # doi after works    
grep -o -P '(?<=abstract":").*?(?=","DOI)' | # get text between abstract":" and ","DOI
sed -E 's/<jats:p>|<\\\/jats:p>/\n/g'      | # substitute paragraph tags 
sed 's/<[^>]*>/ /g'                          # remove other tags

# add "echo" to show unicode characters

echo -e $(curl http://api.crossref.org/works/10.1155/2016/3845247 2>&1  | # doi after works    
grep -o -P '(?<=abstract":").*?(?=","DOI)' | # get text between abstract":" and ","DOI
sed -E 's/<jats:p>|<\\\/jats:p>/\n/g'      | # substitute paragraph tags 
sed 's/<[^>]*>/ /g')                         # remove other tags

using R:

library(rcrossref)
cr_abstract(doi = '10.1109/TASC.2010.2088091')

Upvotes: 1

Randall
Randall

Reputation: 3044

Crossref may be worth checking. They allow members to include abstracts with the metadata, but it's optional, so it isn't comprehensive coverage. According to their helpdesk when I asked, they have abstracts available for around 450,000 DOIs registered as of June 2016.

If an abstract exists in their metadata, you can get it using their UNIXML format. Here's one specific example:

curl -LH "Accept:application/vnd.crossref.unixref+xml" http://dx.crossref.org/10.1155/2016/3845247

Upvotes: 2

Franck Dernoncourt
Franck Dernoncourt

Reputation: 83387

If the article is on PubMed (which contains around 25 million documents), you can use the Python package Entrez to retrieve the abstract.

Upvotes: 1

mlee_jordan
mlee_jordan

Reputation: 842

You can have a look at crossref text and datamining (tdm) service (http://tdmsupport.crossref.org/). This organization provides a RESTful API for free. There are more than 4000 publishers contributing to this tdm service. You can find some examples from the link below:

https://github.com/CrossRef/rest-api-doc/blob/master/rest_api_tour.md

But to give a very simple example:

If you go to the link

http://api.crossref.org/works/10.1080/10260220290013453

you will see that other than some basic metadata, there are two other metadata namely, license and link where the former one gives under what kind of licence this publication is provided and the latter one gives the url of full text. For our example you will see on the license metadata that the license is creativecommons (CC) which means it is free to be used for tdm purposes. By searching for the publications with CC licenses within crossref you can access hundred thousands of publications with their full texts. From my latest research i can say that hindawi publication is the most friendly publisher. Even they provide more than 100K publications witt CC license. One last thing is that full texts might be provided in either in xml or pdf format. For those xml formats are highly structured thus easy to extract data.

To sum it up, you can automatically access many full texts through crossref tdm service by employing their API and simply writing a GET request. If you have further questions do not hesitate to ask.

Cheers.

Upvotes: 9

Related Questions