Parsing multiple tables with BeautifulSoup

Question

I'm having problems parsing table data with BeautifulSoup, though I've tried many solutions found here, here, and here. I hate to re-ask but maybe my issue is unique and that is why the above solutions haven't worked, or I'm just an idiot.

So I'm trying to retrieve the flood triggers for any given river from water.weather.gov. I'm using the Mississippi river data because it has the most active measuring stations. Each station has 4 stage triggers that I am trying to obtain: Action, Flood, Moderate, and Major. I have actually been able to extract the table data for those catagories when there are numerical values, however in cases where the table data is "Not Available" the row is skipped, so that when I put the values in the correct stage they are not aligned with the appropriate station trigger.

The table data that I'm trying to extract looks like this:

        Flood Categories (in feet)


                Flood Categories (in feet)


        

        
            Not Available
        


        
            
                Major Flood Stage:
                18
            
            
                Moderate Flood Stage:
                15
            
            
                Flood Stage:
                13
            
            
                Action Stage:
                12
            
            
                Low Stage (in feet):
                -9999

The last Low Stage isn't necessary and I have filtered it out. Here is the code that I have that will populate alert_list with the appropriate values, but without the necessary Not Available:

alert_list = []
alert_values = []
alerts = soup.findAll('td', attrs={'scope':'col'})
for alert in alerts:
    alert_list.append(alert.text.strip()) 

a_values = alert_list[1::2]
alert_list.clear()
major_lvl = a_values[::5]
moderate_lvl = a_values[1::5]
flood_lvl = a_values[2::5]
action_lvl = a_values[3::5]

and the results:

>>> major_lvl
['18', '26', '0', '11', '0', '17', '17', '18', '0', '683', '16', '0', '20', '16', '18', '665', '661', '18', '651', '645', '15.5', '636', '20', '631', '22', '21', '20.5', '21.5', '20', '20', '20.5', '13.5', '18', '18', '20', '18.5', '17', '14', '18', '19', '25', '25', '25', '26', '25', '24', '22', '25', '33', '34', '29', '34', '40', '40', '0', '0', '0', '42', '42', '0', '0', '0', '0', '0', '44', '47', '43', '35', '46', '52', '55', '0', '44', '57', '50', '57', '64', '40', '34', '26', '20']

I just noticed actually that the reason the Not Available tag isn't getting scraped is because it's under the tr tag, not td. How do I add this so that my values line up?

Bill Bell · Accepted Answer

You can also do it with a function. In your case, only the rows that you want have the style attribute. You can spin through all of the tags and accept only those that are tr and that have style.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('weather.htm'), 'lxml')
>>> def acceptable(tag):
...     return tag.name=='tr' and tag.has_attr('style')
... 
>>> for tag in soup.find_all(acceptable):
...     tag.text.replace('
', '').split(':')
...     
['Major Flood Stage', '18']
['Moderate Flood Stage', '15']
['Flood Stage', '13']
['Action Stage', '12']
['Low Stage (in feet)', '-9999']

Edit, in response to to comment:

Omit acceptable and use this.

>>> for tag in soup.find_all('tr'):
...     if tag.has_attr('style'):
...         tag.text.replace('
', '').split(':')
...     elif 'not available' in tag.text.lower():
...         tag.text
...     else:
...         pass
...     
'Not Available'
['Major Flood Stage', '18']
['Moderate Flood Stage', '15']
['Flood Stage', '13']
['Action Stage', '12']
['Low Stage (in feet)', '-9999']

Parsing multiple tables with BeautifulSoup

Answers (2)

Related Questions