BeautifulSoup: Extract the data in the text after a particular span

Question

I know there's loads of similar questions to this out there, I just can't figure out my specific example.

On this page, I want to extract the number '121,320' from the line: 'Mass (Da):121,320'

I can see from BeautifulSoup that this is where I want:

Show »Length:1,094Mass (Da):121,320

I was trying this:

import urllib
import requests
import sys
from bs4 import BeautifulSoup

uniprot_list = ['O00203']
for each_id in uniprot_list:
        data = requests.get('https://www.uniprot.org/uniprot/' + each_id + '#sequences.html')
        soup = BeautifulSoup(data.content, 'html.parser')


        #prints all spans
        print(soup.find_all('span'))

        #prints empty list
        print(soup.find_all('span',title_='The mass of the unprocessed protein, in Daltons.'))

The closest I've gotten was by trying to follow this answer on SO:

    div1 = soup.find("div", { "class" : "sequence-isoform-rightcol" }).findAll('span', { "class" : "sequence-field-header tooltiped" })
    for x in div1:
            print(x.text)

The issue is that is prints out:

Length:
Mass (Da):

without the actual values.

How do I extract the mass from each page that I have? In this case 121,320?

KunduK · Accepted Answer

You can use regular expression re to search the text and then use find_next('span')

import re
import urllib
import requests
import sys
from bs4 import BeautifulSoup

uniprot_list = ['O00203']
for each_id in uniprot_list:
        data = requests.get('https://www.uniprot.org/uniprot/' + each_id + '#sequences.html')
        soup = BeautifulSoup(data.content, 'html.parser')
        print(soup.find('span',text=re.compile("Mass")).find_next('span').text)

Output:

121,320

Or if you have Bs4 4.7 and above then you can use following css selector.

import urllib
import requests
import sys
from bs4 import BeautifulSoup

uniprot_list = ['O00203']
for each_id in uniprot_list:
        data = requests.get('https://www.uniprot.org/uniprot/' + each_id + '#sequences.html')
        soup = BeautifulSoup(data.content, 'html.parser')
        print(soup.select_one('span:contains("Mass (Da)")').find_next('span').text)

Output:

121,320

BeautifulSoup: Extract the data in the text after a particular span

Answers (2)

Related Questions