tenebris silentio
tenebris silentio

Reputation: 519

Python - Beautiful Soup: Webscraping PubMed - extracting PMIDs (an article ID), adding to list, and preventing duplicate scraping

I want to extract research abstracts on PubMed. I will have multiple URLs to search for publications and some of them will have the same articles as others. Each article has a unique ID called a PMID. Basically, the abstract of each URL is a substring + the PMID (example: https://pubmed.ncbi.nlm.nih.gov/ + 32663045). However, I don't want to extract the same article twice for multiple reasons (i.e., takes longer to complete the entire code, uses up more bandwidth), so once I extract the PMID, I add it to a list. I'm trying to make my code only extract information from the abstract just once, however my code is still extracting duplicate PMIDs and publication titles.

I know how to get rid of duplicates in Pandas in my output, but that's not what I want to do. I want to basically skip over PMIDs/URLs that I already scraped.

Current Output

Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086

Desired Output

Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086

Here's my code:

from bs4 import BeautifulSoup
import csv
import time
import requests
import pandas as pd

all_pmids = []
out = []

search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=']
for search_url in search_urls:
    
    response = requests.get(search_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
    for p in pmids:
        p = p.get_text()
        all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')

    for pmid in all_pmids:
        url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
        response2 = requests.get(url)
        soup2 = BeautifulSoup(response2.content, 'html.parser')

        title = soup2.select('h1.heading-title')[0].text.strip()
        
        data = {'title': title, 'pmid': pmid, 'url':url}
        time.sleep(3)
        out.append(data)
df = pd.DataFrame(out)

df.to_excel('my_results.xlsx')



 

Upvotes: 1

Views: 2359

Answers (2)

Lambda
Lambda

Reputation: 1392

You should move the for pmid in all_pmids loop outside the for search_url in search_urls loop

...
for search_url in search_urls:

    response = requests.get(search_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
    for p in pmids:
        p = p.get_text()
        all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')

## move this for loop outside!!
for pmid in all_pmids:
    url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
    response2 = requests.get(url)
    soup2 = BeautifulSoup(response2.content, 'html.parser')

...

Upvotes: 2

goalie1998
goalie1998

Reputation: 1442

Just an indentation error, or more accurately, where you are running your two for loops. If it isn't just an overlooked mistake, read the explanation. If it is just a mistake, unindent your second for loop.

Because you are searching all_pmids within your larger search_url loop without resetting it after each search, it finds the first two pmids, adds them to all_pmids, then runs the next loop for those two.

In the second run of the outer loop, it finds the next two pmids, sees they're already in ```all_pmids`` so doesn't add them, but still runs the inner loop on the first two still stored in the list.

You should run the inner loop separately, as such:

from bs4 import BeautifulSoup
import csv
import time
import requests
import pandas as pd

all_pmids = []
out = []

search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=']
for search_url in search_urls:
    
    response = requests.get(search_url)
    soup = BeautifulSoup(response.content, 'html.parser')

    pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
    for p in pmids:
        p = p.get_text()
        all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')

for pmid in all_pmids:
    url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
    response2 = requests.get(url)
    soup2 = BeautifulSoup(response2.content, 'html.parser')

    title = soup2.select('h1.heading-title')[0].text.strip()
        
    data = {'title': title, 'pmid': pmid, 'url':url}
    time.sleep(3)
    out.append(data)
df = pd.DataFrame(out)

df.to_excel('my_results.xlsx')

Upvotes: 2

Related Questions