SnarkShark
SnarkShark

Reputation: 380

Scraping text with subscripts with BeautifulSoup in Python

all! I'm working on my first web scraper ever, which grabs author names, URLs, and paper names from PMC, when given a "CitedBy" page like this

My program works fine for getting the author names and the URL's, however I can only get some of the paper titles, which I suspect is due to subscripts and superscripts.

Here's what I've got so far:

    import requests
    from bs4 import BeautifulSoup
    import re

    url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2593677/citedby/?page=0'
    req = requests.get(url)
    plain_text = req.text
    soup = BeautifulSoup(plain_text, "lxml") #soup object

    titles_list = []

    for items in soup.findAll('div', {'class': 'title'}):
        title = items.string
        if title is None:
            title = ("UHOH") #Problems with some titles
        #print(title)
        titles_list.append(title)

When I run this part of my code, my scraper gives me these results:

  1. Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids
  2. UHOH
  3. Comprehensive Comparative Genomic and Transcriptomic Analyses of the Legume Genes Controlling the Nodulation Process
  4. UHOH
  5. Dosage Sensitivity of RPL9 and Concerted Evolution of Ribosomal Protein Genes in Plants

And so on for the whole page...

Some papers on this page that I get "UHOH" for are:

The first two I've listed here I believe are problematic because of "C4" and "F1" are actually "C subscript 4" and "F subscript 1". For the third one, "Medicago truncatula" is in an "em" HTML tag, so I suspect that this is why my scraper cannot scrape it.

The only alternative solution I've thought of is making my "soup.findAll" more specific, but that didn't end up helping me. I tried:

for items in soup.findAll('div', {'class': 'title'}):
        title = items.string
        if title is None:
            for other in soup.findAll('a', {'class': 'view'}):
                title = other.string

But sadly, this didn't work... So I'm not exactly sure how to approach this. Does anybody know how to handle special cases like these? Thank you so much!

Upvotes: 1

Views: 642

Answers (1)

SnarkShark
SnarkShark

Reputation: 380

Thanks to @LukasGraf, I have the answer!

Since I'm using the BeautifulSoup, I can use node.get_text(). It works different from the plain, ".string" because it also returns all the text beneath a tag, which was the case for the subscripts and "em" HTML marked text.

Upvotes: 1

Related Questions