Parsing HTML with BeautifulSoup in Python

Question

I am trying to parse HTML with Python using BeautifulSoup, but I can't manage to get what I need.

This is a little module of a personal app I want to do, and it consists in a web login part with credentials, and once the script is logged in the web, I need to parse some information in order to manage it and process it.

The HTML code after getting logged is:



        Account Balance

    

    

        

            

                

                    Daily Earnings

                    

                        150                         

                    

                

                

                    Weekly Earnings

                    

                        500                     

                

                

                    Monthly Earnings

                    

                        1500                        

                

                

                    Total expended

                    

                        430                     

                

                

                    Account Balance

                    

                        840

As you can see, it's not a very well-formatted HTML, but I'd need to extract the elements and their values, I mean, for example: "Daily earnings" and "150" | "Weekly earnings" and "500"...

I think that the "id" attribute may help, but when I try to parse it, it crashes.

The Python code I'm working with is:

def parseo(archivohtml):
    html = archivohtml
    parsed_html = BeautifulSoup(html)
    par = parsed_html.find('td', attrs={'id':'west1'}).string
    print par

Where archivohtml is the saved html file after logging in the web

When I run the script, I only get errors.

I've also tried doing this:

def parseo(archivohtml):
    soup = BeautifulSoup()
    html = archivohtml
    parsed_html = soup(html)
    par = soup.parsed_html.find('td', attrs={'id':'west1'}).string
    print par

But the result is still the same.

unutbu · Accepted Answer

The tag with id="west1" is an tag. You are looking for the tag that comes after this tag:

import BeautifulSoup as bs

content = '''
        Account Balance
    
    
        
            
                
                    Daily Earnings
                    
                        150                         
                    
                
                
                    Weekly Earnings
                    
                        500                     
                
                
                    Monthly Earnings
                    
                        1500                        
                
                
                    Total expended
                    
                        430                     
                
                
                    Account Balance
                    
                        840                     
                
                
                    
                    
                        
                            
                                
                                
                                
                            
                        
                    
                
            
        
    
'''

def parseo(archivohtml):
    html = archivohtml
    parsed_html = bs.BeautifulSoup(html)
    par = parsed_html.find('a', attrs={'id':'west1'}).findNext('td')        
    print par.string.strip()

parseo(content)

yields

Parsing HTML with BeautifulSoup in Python

Answers (2)

Related Questions