Reputation: 131
Let's say I have the following website:
https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod
When you go on this website, it displays a bunch of information. In my case, I just want to the temperature from the Culture Culture Conditions section.
when you scroll down the webpage, you will see a section called "Culture Conditions"
Atmosphere: air, 95%; carbon dioxide (CO2), 5%
Temperature: 37°C
using the requests library, I'm able to get to the HTML code of the page. when I save the HTML and search through it for my data it's towards the bottom
in this form
Culture Conditions
</th>
<td>
<div><strong>Atmosphere: </strong>air, 95%; carbon dioxide (CO<sub>2</sub>), 5%</div><div><strong>Temperature: </strong>37°C</div>
I'm not sure what to do after this. I looked into using BeautifulSoup to parse the HTML but i was not successful.
this is all the code that I have so far.
import requests
url='https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod'
page = requests.get(url)
textPage = str(page.text)
file = open('test2', 'w')
file.write(textPage)
file.close()
Upvotes: 1
Views: 231
Reputation: 22440
Another way you may find useful is something like below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod'
page = requests.get(url)
soup = BeautifulSoup(page.text,"lxml")
for items in soup.find_all("strong"):
if "Atmosphere:" in items.text:
atmos = items.find_parent().text
temp = items.find_parent().find_next_sibling().text
print(f'{atmos}\n{temp}')
Output:
Atmosphere: air, 95%; carbon dioxide (CO2), 5%
Temperature: 37°C
Upvotes: 0
Reputation: 844
import requests
from bs4 import BeautifulSoup
url = 'https://www.atcc.org/Products/All/CRL-2528.aspx#culturemethod'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
cc = soup.select('#layoutcontent_2_middlecontent_0_productdetailcontent_0_maincontent_2_rptTabContent_rptFields_2_fieldRow_3 td div')
for c in cc:
print(c.text.strip())
Output:
Atmosphere: air, 95%; carbon dioxide (CO2), 5%
Temperature: 37°C
To just get the temperature:
cc = soup.select('#layoutcontent_2_middlecontent_0_productdetailcontent_0_maincontent_2_rptTabContent_rptFields_2_fieldRow_3 td div')[-1]
cc = cc.text.split(':')[-1].strip()
print(cc)
Output:
37°C
Upvotes: 2
Reputation: 4379
I did a regular expression that search for the line starting by <div><strong>Atmosphere:
and take all until the end of the line. Then I removed every unwanted strings from the result. Et Voila!
import re
textPage = re.search(r"<div><strong>Atmosphere: .*", textPage).group(0)
wrongString = ['<div>','</div>','<strong>','</strong>','<sub>','</sub>']
for ws in wrongString:
textPage = re.sub(ws, "", textPage)
file = open('test2', 'w')
file.write(textPage)
file.close()
Upvotes: 1