Reputation: 380
all! I'm working on my first web scraper ever, which grabs author names, URLs, and paper names from PMC, when given a "CitedBy" page like this
My program works fine for getting the author names and the URL's, however I can only get some of the paper titles, which I suspect is due to subscripts and superscripts.
Here's what I've got so far:
import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2593677/citedby/?page=0'
req = requests.get(url)
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml") #soup object
titles_list = []
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
title = ("UHOH") #Problems with some titles
#print(title)
titles_list.append(title)
When I run this part of my code, my scraper gives me these results:
And so on for the whole page...
Some papers on this page that I get "UHOH" for are:
The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny
Cross-Family Translational Genomics of Abiotic Stress-Responsive Genes between Arabidopsis and Medicago truncatula
The first two I've listed here I believe are problematic because of "C4" and "F1" are actually "C subscript 4" and "F subscript 1". For the third one, "Medicago truncatula" is in an "em" HTML tag, so I suspect that this is why my scraper cannot scrape it.
The only alternative solution I've thought of is making my "soup.findAll" more specific, but that didn't end up helping me. I tried:
for items in soup.findAll('div', {'class': 'title'}):
title = items.string
if title is None:
for other in soup.findAll('a', {'class': 'view'}):
title = other.string
But sadly, this didn't work... So I'm not exactly sure how to approach this. Does anybody know how to handle special cases like these? Thank you so much!
Upvotes: 1
Views: 642
Reputation: 380
Thanks to @LukasGraf, I have the answer!
Since I'm using the BeautifulSoup, I can use node.get_text(). It works different from the plain, ".string" because it also returns all the text beneath a tag, which was the case for the subscripts and "em" HTML marked text.
Upvotes: 1