Reputation: 674
The following code extracts data from a specific table on a webpage:
import requests
from bs4 import BeautifulSoup
url="XYZ"
sector_response=requests.get(url)
soup=BeautifulSoup(sector_response.content,'lxml')
#Find the desired table
table=soup.find('table',attrs={'class': 'snapshot-data-tbl'})
headings = [th.get_text() for th in table.find("tr").find_all("th")]
for row in table.find_all("tr"):
dataset = list(zip(headings, (td.get_text() for td in row.find_all("td"))))
#Exclude the 'Weighting Recommendations' tuple
new_dataset=[i for i in dataset if i[0]!='Weighting Recommendations']
for item in new_dataset:
print(item)
However, each of the cells in the body of the table contain a timestamp span class that I don't need. How can I exclude these?
For example:
<td>
<span class="negative">-0.39%</span>
<span class="timestamp"><time>04:20 PM ET 09/28/2018</time></span>
</td>
Current output:
('Last % Change', '\n-0.39%\n04:20 PM ET 09/28/2018\n')
Desired output:
('Last % Change', -0.39)
Upvotes: 1
Views: 432
Reputation: 2308
If the span class name for the target span is always “negative” you could do the following:
for row in table.find_all("tr"):
dataset = list(zip(headings, (td.find(‘span’, { “class”: “negative”} ).get_text() for td in row.find_all(“td”))))
Or if it’s not always “negative” you could find
for row in table.find_all("tr"):
dataset = list(zip(headings, (td.find(‘span’).get_text() for td in row.find_all(“td”))))
Also to let your program run smoothly try to catch all possible errors. For example what if the td couldn’t be found?
Now it will just crash.
Upvotes: 1