Reputation: 191
I'm having problems parsing table data with BeautifulSoup, though I've tried many solutions found here, here, and here. I hate to re-ask but maybe my issue is unique and that is why the above solutions haven't worked, or I'm just an idiot.
So I'm trying to retrieve the flood triggers for any given river from water.weather.gov. I'm using the Mississippi river data because it has the most active measuring stations. Each station has 4 stage triggers that I am trying to obtain: Action, Flood, Moderate, and Major. I have actually been able to extract the table data for those catagories when there are numerical values, however in cases where the table data is "Not Available" the row is skipped, so that when I put the values in the correct stage they are not aligned with the appropriate station trigger.
The table data that I'm trying to extract looks like this:
<div class="box_square"> <b><b>Flood Categories (in feet)</b><br>
</b>
<table width="150" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr><td nowrap="">Not Available</td></tr>
</tbody>
<div class="box_square"> <b><b>Flood Categories (in feet)</b><br>
</b>
<table width="150" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr style="display:'';line-height:20px;background-color:#CC33FF;color:black">
<td scope="col" nowrap="">Major Flood Stage:</td>
<td scope="col">18</td>
</tr>
<tr style="display:'';line-height:20px;background-color:#FF0000;color:white">
<td scope="col" nowrap="">Moderate Flood Stage:</td>
<td scope="col">15</td>
</tr>
<tr style="display:'';line-height:20px;background-color:#FF9900;color:black">
<td scope="col" nowrap="">Flood Stage:</td>
<td scope="col">13</td>
</tr>
<tr style="display:'';line-height:20px;background-color:#FFFF00;color:black">
<td scope="col" nowrap="">Action Stage:</td>
<td scope="col">12</td>
</tr>
<tr style="display:none;line-height:20px;background-color:#906320;color:white">
<td scope="col" nowrap="">Low Stage (in feet):</td>
<td scope="col">-9999</td>
</tr>
</tbody>
</table><br></div>
The last Low Stage isn't necessary and I have filtered it out. Here is the code that I have that will populate alert_list
with the appropriate values, but without the necessary Not Available:
alert_list = []
alert_values = []
alerts = soup.findAll('td', attrs={'scope':'col'})
for alert in alerts:
alert_list.append(alert.text.strip())
a_values = alert_list[1::2]
alert_list.clear()
major_lvl = a_values[::5]
moderate_lvl = a_values[1::5]
flood_lvl = a_values[2::5]
action_lvl = a_values[3::5]
and the results:
>>> major_lvl
['18', '26', '0', '11', '0', '17', '17', '18', '0', '683', '16', '0', '20', '16', '18', '665', '661', '18', '651', '645', '15.5', '636', '20', '631', '22', '21', '20.5', '21.5', '20', '20', '20.5', '13.5', '18', '18', '20', '18.5', '17', '14', '18', '19', '25', '25', '25', '26', '25', '24', '22', '25', '33', '34', '29', '34', '40', '40', '0', '0', '0', '42', '42', '0', '0', '0', '0', '0', '44', '47', '43', '35', '46', '52', '55', '0', '44', '57', '50', '57', '64', '40', '34', '26', '20']
I just noticed actually that the reason the Not Available tag isn't getting scraped is because it's under the tr tag, not td. How do I add this so that my values line up?
Upvotes: 0
Views: 1844
Reputation: 61225
If you are only interested in those column where scope=col
, you can use a css selector to do this beautifully.
In [24]: soup = BS(html, "html.parser")
In [25]: major_list = [td.get_text(strip=True) for td in soup.select("tr > td:nth-of-type(2)[scope=col]")[:-1]]
In [26]: major_list
Out[26]: ['18', '15', '13', '12']
To get all the rows alongside their column, you need to select
the rows first and for each row retrieve the data in the column.
for tr in soup.select("div[class=box_square] tr"):
print([td.get_text(strip=True) for td in tr.find_all("td")])
Upvotes: 1
Reputation: 21643
You can also do it with a function. In your case, only the rows that you want have the style
attribute. You can spin through all of the tags and accept only those that are tr
and that have style
.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('weather.htm'), 'lxml')
>>> def acceptable(tag):
... return tag.name=='tr' and tag.has_attr('style')
...
>>> for tag in soup.find_all(acceptable):
... tag.text.replace('\n', '').split(':')
...
['Major Flood Stage', '18']
['Moderate Flood Stage', '15']
['Flood Stage', '13']
['Action Stage', '12']
['Low Stage (in feet)', '-9999']
Edit, in response to to comment:
Omit acceptable
and use this.
>>> for tag in soup.find_all('tr'):
... if tag.has_attr('style'):
... tag.text.replace('\n', '').split(':')
... elif 'not available' in tag.text.lower():
... tag.text
... else:
... pass
...
'Not Available'
['Major Flood Stage', '18']
['Moderate Flood Stage', '15']
['Flood Stage', '13']
['Action Stage', '12']
['Low Stage (in feet)', '-9999']
Upvotes: 1