Reputation: 519
I want to extract research abstracts on PubMed. I will have multiple URLs to search for publications and some of them will have the same articles as others. Each article has a unique ID called a PMID. Basically, the abstract of each URL is a substring + the PMID (example: https://pubmed.ncbi.nlm.nih.gov/ + 32663045). However, I don't want to extract the same article twice for multiple reasons (i.e., takes longer to complete the entire code, uses up more bandwidth), so once I extract the PMID, I add it to a list. I'm trying to make my code only extract information from the abstract just once, however my code is still extracting duplicate PMIDs and publication titles.
I know how to get rid of duplicates in Pandas in my output, but that's not what I want to do. I want to basically skip over PMIDs/URLs that I already scraped.
Current Output
Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
Desired Output
Title| PMID
COVID-19 And Racial/Ethnic Disparities In Health Risk | 32663045
The Risk Of Severe COVID-19 | 32941086
Here's my code:
from bs4 import BeautifulSoup
import csv
import time
import requests
import pandas as pd
all_pmids = []
out = []
search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=']
for search_url in search_urls:
response = requests.get(search_url)
soup = BeautifulSoup(response.content, 'html.parser')
pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
for p in pmids:
p = p.get_text()
all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')
for pmid in all_pmids:
url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
response2 = requests.get(url)
soup2 = BeautifulSoup(response2.content, 'html.parser')
title = soup2.select('h1.heading-title')[0].text.strip()
data = {'title': title, 'pmid': pmid, 'url':url}
time.sleep(3)
out.append(data)
df = pd.DataFrame(out)
df.to_excel('my_results.xlsx')
Upvotes: 1
Views: 2359
Reputation: 1392
You should move the for pmid in all_pmids
loop outside the for search_url in search_urls
loop
...
for search_url in search_urls:
response = requests.get(search_url)
soup = BeautifulSoup(response.content, 'html.parser')
pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
for p in pmids:
p = p.get_text()
all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')
## move this for loop outside!!
for pmid in all_pmids:
url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
response2 = requests.get(url)
soup2 = BeautifulSoup(response2.content, 'html.parser')
...
Upvotes: 2
Reputation: 1442
Just an indentation error, or more accurately, where you are running your two for loops. If it isn't just an overlooked mistake, read the explanation. If it is just a mistake, unindent your second for loop.
Because you are searching all_pmids
within your larger search_url
loop without resetting it after each search, it finds the first two pmids
, adds them to all_pmids
, then runs the next loop for those two.
In the second run of the outer loop, it finds the next two pmids
, sees they're already in ```all_pmids`` so doesn't add them, but still runs the inner loop on the first two still stored in the list.
You should run the inner loop separately, as such:
from bs4 import BeautifulSoup
import csv
import time
import requests
import pandas as pd
all_pmids = []
out = []
search_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=','https://pubmed.ncbi.nlm.nih.gov/?term=%28AHRQ%5BAffiliation%5D%29+AND+%28COVID-19%5BText+Word%5D%29&sort=']
for search_url in search_urls:
response = requests.get(search_url)
soup = BeautifulSoup(response.content, 'html.parser')
pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
for p in pmids:
p = p.get_text()
all_pmids.append(p) if p not in all_pmids else print('project already in list, skipping')
for pmid in all_pmids:
url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
response2 = requests.get(url)
soup2 = BeautifulSoup(response2.content, 'html.parser')
title = soup2.select('h1.heading-title')[0].text.strip()
data = {'title': title, 'pmid': pmid, 'url':url}
time.sleep(3)
out.append(data)
df = pd.DataFrame(out)
df.to_excel('my_results.xlsx')
Upvotes: 2