Python - Beautiful Soup: Webscraping PubMed - extracting PMIDs (an article ID), adding to list, and preventing duplicate scraping

Question

I want to extract research abstracts on PubMed. I will have multiple URLs to search for publications and some of them will have the same articles as others. Each article has a unique ID called a PMID. Basically, the abstract of each URL is a substring + the PMID (example: https://pubmed.ncbi.nlm.nih.gov/ + 32663045). However, I don't want to extract the same article twice for multiple reasons (i.e., takes longer to complete the entire code, uses up more bandwidth), so once I extract the PMID, I add it to a list. I'm trying to make my code only extract information from the abstract just once, however my code is still extracting duplicate PMIDs and publication titles.

I know how to get rid of duplicates in Pandas in my output, but that's not what I want to do. I want to basically skip over PMIDs/URLs that I already scraped.

Current Output

Desired Output

Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086

Here's my code:

from bs4 import BeautifulSoup
import csv
import time
import requests
import pandas as pd

all_pmids = []
out = []

search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=']
for search_url in search_urls:
    
    response = requests.get(search_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
    for p in pmids:
        p = p.get_text()
        all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')

    for pmid in all_pmids:
        url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
        response2 = requests.get(url)
        soup2 = BeautifulSoup(response2.content, 'html.parser')

        title = soup2.select('h1.heading-title')[0].text.strip()
        
        data = {'title': title, 'pmid': pmid, 'url':url}
        time.sleep(3)
        out.append(data)
df = pd.DataFrame(out)

df.to_excel('my_results.xlsx')

goalie1998 · Accepted Answer

Just an indentation error, or more accurately, where you are running your two for loops. If it isn't just an overlooked mistake, read the explanation. If it is just a mistake, unindent your second for loop.

Because you are searching all_pmids within your larger search_url loop without resetting it after each search, it finds the first two pmids, adds them to all_pmids, then runs the next loop for those two.

In the second run of the outer loop, it finds the next two pmids, sees they're already in ```all_pmids`` so doesn't add them, but still runs the inner loop on the first two still stored in the list.

You should run the inner loop separately, as such:

from bs4 import BeautifulSoup
import csv
import time
import requests
import pandas as pd

all_pmids = []
out = []

search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=']
for search_url in search_urls:
    
    response = requests.get(search_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
    for p in pmids:
        p = p.get_text()
        all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')

for pmid in all_pmids:
    url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
    response2 = requests.get(url)
    soup2 = BeautifulSoup(response2.content, 'html.parser')

    title = soup2.select('h1.heading-title')[0].text.strip()
        
    data = {'title': title, 'pmid': pmid, 'url':url}
    time.sleep(3)
    out.append(data)
df = pd.DataFrame(out)

df.to_excel('my_results.xlsx')

Python - Beautiful Soup: Webscraping PubMed - extracting PMIDs (an article ID), adding to list, and preventing duplicate scraping

Answers (2)

Related Questions