Reputation: 1511
I know there's loads of similar questions to this out there, I just can't figure out my specific example.
On this page, I want to extract the number '121,320' from the line: 'Mass (Da):121,320'
I can see from BeautifulSoup that this is where I want:
</div><a class="show-link" href="#" id="O00203-show-link" style="display:none">Show »</a></div><div class="sequence-isoform-rightcol"><div><span class="sequence-field-header tooltiped" title="Sequence length.">Length:</span><span>1,094</span></div><div><span class="sequence-field-header tooltiped" title="The mass of the unprocessed protein, in Daltons.">Mass (Da):</span><span>121,320</span>
I was trying this:
import urllib
import requests
import sys
from bs4 import BeautifulSoup
uniprot_list = ['O00203']
for each_id in uniprot_list:
data = requests.get('https://www.uniprot.org/uniprot/' + each_id + '#sequences.html')
soup = BeautifulSoup(data.content, 'html.parser')
#prints all spans
print(soup.find_all('span'))
#prints empty list
print(soup.find_all('span',title_='The mass of the unprocessed protein, in Daltons.'))
The closest I've gotten was by trying to follow this answer on SO:
div1 = soup.find("div", { "class" : "sequence-isoform-rightcol" }).findAll('span', { "class" : "sequence-field-header tooltiped" })
for x in div1:
print(x.text)
The issue is that is prints out:
Length:
Mass (Da):
without the actual values.
How do I extract the mass from each page that I have? In this case 121,320?
Upvotes: 0
Views: 1043
Reputation: 33384
You can use regular expression re
to search the text and then use find_next('span')
import re
import urllib
import requests
import sys
from bs4 import BeautifulSoup
uniprot_list = ['O00203']
for each_id in uniprot_list:
data = requests.get('https://www.uniprot.org/uniprot/' + each_id + '#sequences.html')
soup = BeautifulSoup(data.content, 'html.parser')
print(soup.find('span',text=re.compile("Mass")).find_next('span').text)
Output:
121,320
Or if you have Bs4 4.7 and above then you can use following css selector.
import urllib
import requests
import sys
from bs4 import BeautifulSoup
uniprot_list = ['O00203']
for each_id in uniprot_list:
data = requests.get('https://www.uniprot.org/uniprot/' + each_id + '#sequences.html')
soup = BeautifulSoup(data.content, 'html.parser')
print(soup.select_one('span:contains("Mass (Da)")').find_next('span').text)
Output:
121,320
Upvotes: 1
Reputation: 2868
from bs4 import BeautifulSoup
data = '''
<html>
<body>
</div><a class="show-link" href="#"
id="O00203showlink"style="display:none">Show»</a</div><div class="sequence-isoform-
rightcol"><div><span class="sequence-field-header tooltiped" title="Sequence
length.">Length:</span><span>1,094</span></div><div><span class="sequence-field-header
tooltiped" title="The mass of the unprocessed protein, in Daltons.">Mass (Da):</span>
<span>121,320</span>'
</body>
</html>
'''
soup = BeautifulSoup(a,'lxml')
span_text = [x.text for x in soup.findAll('span')]
#op
['Length:', '1,094', 'Mass (Da):', '121,320']
Upvotes: 1