Python: parse html and produce a tabular text file

Question

The problem: I want to parse an html code and retrieve a file of tabular text such as this:

East Counties
Babergh, http://ratings.food.gov.uk/OpenDataFiles/FHRS297en-GB.xml, 876
Basildon, http://ratings.food.gov.uk/OpenDataFiles/FHRS109en-GB.xml, 1134
...
...

What I get instead: Only East Counties appears in the txt file, so the for loop fails to print each new region. Attempt code is after the html code.

HTML code: The code can be found in this html page, of which this is the excerpt referring to the above table:


                                    East Counties

                                        My attempt:
from xml.dom import minidom
import urllib2
from bs4 import BeautifulSoup

url='http://ratings.food.gov.uk/open-data/'
f = urllib2.urlopen(url)
mainpage = f.read()
soup = BeautifulSoup(mainpage, 'html.parser')

regions=[]
with open('Regions_and_files.txt', 'w') as f:
    for h2 in soup.find_all('h2')[6:]: #Skip 6 h2 lines 
        region=h2.text.strip() #Get the text of each h2 without the white spaces
        regions.append(str(region))
        f.write(region+'
')
        for tr in soup.find_all('tr')[1:]: # Skip headers
            tds = tr.find_all('td')
            if len(tds)==0:
                continue
            else:
                a = tr.find_all('a')
                link = str(a)[10:67]
                span = tr.find_all('span')
                places = int(str(span[3].text).replace(',', ''))
                f.write("%s,%s,%s" % \
                              (str(tds[0].text)[1:-1], link, places)+'
')
How can I fix this?

                                            
                                                
                                                    
                                                        Local authority
                                                    
                                                    
                                                        Last update
                                                    
                                                    
                                                        Number of businesses
                                                    
                                                    
                                                        Download
                                                    
                                                
                                            

                                        
                                            
                                                Babergh
                                            
                                            
                                                04/05/2017 
                                                at
                                                 12:00
                                            
                                            
                                                876
                                            
                                            
                                                English language
                                            
                                        

                                        
                                            
                                                Basildon
                                            
                                            
                                                06/05/2017 
                                                at
                                                 12:00
                                            
                                            
                                                1,134
                                            
                                            
                                                English language

Vitaly · Accepted Answer

I'm not familiar with the Beautiful Soup library, but judging from the code it looks like in each h2 cycle you are traversing all the tr elements of the document. You should instead traverse only rows that belong to the table related to the specific h2 element.

Edited: After a quick look at Beautiful Soup docs looks like you can use .next_sibling since h2 is always followed by the table, i.e. table = h2.next_sibling.next_sibling (called twice because the first sibling is a string containing whitespace). From the table you can then traverse all its rows.

The reason you are getting duplicates for Wales is because there actually are duplicates in the source.

Python: parse html and produce a tabular text file

Answers (1)

Related Questions