Python web scraping table with sub headings

Question

I am trying to extract some information from a table which appears on various webpages (My apologies for not disclosing the webpage).


 
 First sub-class
 
  
 Info1
 Object 1
 Info 2
 Object 2
 Info 3
 Info 4
 Info 5
 
 

 Second sub-class
 
 
 Info11
 Object 11
 Info 22
 Object 22
 Info 33
 Info 44
 Info 55
 
 
 
 Third sub-class
 
 
 Info 111
 Object 111
 Info 222
 Object 222
 Info 333
 Info 444
 Info 555

Where the table essentially looks like the following:

Image 1

The problem is that both the the sub-classes and the number of rows for each subclass may change. So for example, the First sub-class in some cases may have 1 items, Second sub-class may have 3 items and the third sub-class may have 2 items. Additionally I may also get a table with only sub-class 1 and 2.

For example:

Image 2

OR

Image 3

are also possible.

I want to get the data in a format so that the sub-class value appears alongside the relevant info rows in the following format (following image 1):

Image 4

However I am a little stuck with how to achieve this on python since the table headings are not separate classes under which each row item appears. I can call a webpage on using web driver and extract the page source using beautiful soup. However, I could not figure out how to assign the sub-classes to the rows in this case (especially since the info rows do not appear as an element of the subclass row but merely as a new row of the table).

As of now, I can use .find_all('tr') to get all the rows of the tables. However since the sub-classes and the number of rows are not uniform across the (500 or so tables), I cannot seem to get my head around to getting the data. Any help would be appreciated.

p

Josep Valls · Accepted Answer

Just process your HTML row by row:

b = bs4.BeautifulSoup(html)
data = {}
current = None
for row in b.find_all('tr'):
    if row.find_all('th'):
        # this is a header
        current = row.find_all('th')[0].text
    else:
        # this is not a header, therefore is data under the last header seen
        data[current] = row.find_all('td') # do whatever processing you need to do here, you did't specify

If you need to preserve the order of your headers, instead of a dict, use a list of lists:

data = []
headers = []

for row in b.find_all('tr'):
    if row.find_all('th'):
        # this is a header
        headers.append(row.find_all('th')[0].text)
        data.append([])
    else:
        # this is not a header, therefore is data under the last header seen
        data[-1].append(row.find_all('td'))
print zip(headers,data)

Python web scraping table with sub headings

Answers (2)

Related Questions