How to handle nested html tables with beautifulsoup?

Question

I am loading an HTML file into a data frame using BeautifulSoup. The table that I am parsing contains a nested table in every row, and I'm not sure how to handle this as it's giving me an AssertionError...trying to load 4 columns when there are only 3 columns in the data frame.

Here is the beginning of the html table showing the headers and the first row of data:

Below is the code that I wrote to read the data into a dataframe:
    table = soup.find('table', id = 'tableid1')
    table_rows = table.find_all('tr')

    allData=[]
    for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text for i in td]
        allData.append(row)
     headers = allData.pop(0)
     self.d1_bundle_df = pd.DataFrame(allData, columns = headers)
When the above code is running, it generates the following error:
AssertionError: 3 columns passed, passed data had 4 columns
What's the best way to handle these nested tables?
This is still relatively new to me, so any direction would be greatly appreciated.

         
         
            Bundle Name
            Insulation Name / Layer / Layer PN
            Bundle Width
         
         
            BN100175-100861
            
               
                  
                     B29* / 10 / POLYETHYLENE_CONDUIT
                  
               
            
            25.53825

Andrej Kesely · Accepted Answer

Problem is you are searching in row for all , but these can contain other in your case. One solution is use CSS selectors and search only for which don't have other :

data = ''''''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

rows = []
for tr in soup.select('#tableid1 > tr'):
    rows.append([td.get_text(strip=True) for td in tr.select('td:not(:has(td))')])

from pprint import pprint
pprint(rows)
Prints:
[['Bundle Name', 'Insulation Name / Layer / Layer PN', 'Bundle Width'],
 ['BN100175-100861', 'B29* / 10 / POLYETHYLENE_CONDUIT', '25.53825']]
The CSS selector #tableid1 > tr will search for all 
 that are directly under the tag with id=tableid1
The CSS selector td:not(:has(td)) will search for all 

         
         
            Bundle Name
            Insulation Name / Layer / Layer PN
            Bundle Width
         
         
            BN100175-100861
            
               
                  
                     B29* / 10 / POLYETHYLENE_CONDUIT
                  
               
            
            25.53825
         







 that don't contain other .

Further reading:

CSS Selectors Reference

How to handle nested html tables with beautifulsoup?

Answers (1)

Related Questions