Scraping text with subscripts with BeautifulSoup in Python

Question

all! I'm working on my first web scraper ever, which grabs author names, URLs, and paper names from PMC, when given a "CitedBy" page like this

My program works fine for getting the author names and the URL's, however I can only get some of the paper titles, which I suspect is due to subscripts and superscripts.

Here's what I've got so far:

    import requests
    from bs4 import BeautifulSoup
    import re

    url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2593677/citedby/?page=0'
    req = requests.get(url)
    plain_text = req.text
    soup = BeautifulSoup(plain_text, "lxml") #soup object

    titles_list = []

    for items in soup.findAll('div', {'class': 'title'}):
        title = items.string
        if title is None:
            title = ("UHOH") #Problems with some titles
        #print(title)
        titles_list.append(title)

When I run this part of my code, my scraper gives me these results:

Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids
UHOH
Comprehensive Comparative Genomic and Transcriptomic Analyses of the Legume Genes Controlling the Nodulation Process
UHOH
Dosage Sensitivity of RPL9 and Concerted Evolution of Ribosomal Protein Genes in Plants

And so on for the whole page...

Some papers on this page that I get "UHOH" for are:

Comparative cell-specific transcriptomics reveals differentiation of C4 photosynthesis pathways in switchgrass and other C4 lineages
The genome sequence of the outbreeding globe artichoke constructed de novo incorporating a phase-aware low-pass sequencing strategy of F1 progeny
Cross-Family Translational Genomics of Abiotic Stress-Responsive Genes between Arabidopsis and Medicago truncatula

The first two I've listed here I believe are problematic because of "C4" and "F1" are actually "C subscript 4" and "F subscript 1". For the third one, "Medicago truncatula" is in an "em" HTML tag, so I suspect that this is why my scraper cannot scrape it.

The only alternative solution I've thought of is making my "soup.findAll" more specific, but that didn't end up helping me. I tried:

for items in soup.findAll('div', {'class': 'title'}):
        title = items.string
        if title is None:
            for other in soup.findAll('a', {'class': 'view'}):
                title = other.string

But sadly, this didn't work... So I'm not exactly sure how to approach this. Does anybody know how to handle special cases like these? Thank you so much!

SnarkShark · Accepted Answer

Thanks to @LukasGraf, I have the answer!

Since I'm using the BeautifulSoup, I can use node.get_text(). It works different from the plain, ".string" because it also returns all the text beneath a tag, which was the case for the subscripts and "em" HTML marked text.

Scraping text with subscripts with BeautifulSoup in Python

Answers (1)

Related Questions