John Shambles
John Shambles

Reputation: 47

How to handle nested html tables with beautifulsoup?

I am loading an HTML file into a data frame using BeautifulSoup. The table that I am parsing contains a nested table in every row, and I'm not sure how to handle this as it's giving me an AssertionError...trying to load 4 columns when there are only 3 columns in the data frame.

Here is the beginning of the html table showing the headers and the first row of data:

<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
         <colgroup span="3"></colgroup>
         <tr style="background-color: #CCDDFF;" class="header">
            <td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
            <td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
            <td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
         </tr>
         <tr style="white-space: pre-wrap;background-color: #E4E4E4;">
            <td>BN100175-100861</td>
            <td>
               <table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
                  <tr>
                     <td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
                  </tr>
               </table>
            </td>
            <td>25.53825</td>
         </tr>

Below is the code that I wrote to read the data into a dataframe:

    table = soup.find('table', id = 'tableid1')
    table_rows = table.find_all('tr')

    allData=[]
    for tr in table_rows:
        td = tr.find_all('td')
        row = [i.text for i in td]
        allData.append(row)
     headers = allData.pop(0)
     self.d1_bundle_df = pd.DataFrame(allData, columns = headers)

When the above code is running, it generates the following error: AssertionError: 3 columns passed, passed data had 4 columns

What's the best way to handle these nested tables? This is still relatively new to me, so any direction would be greatly appreciated.

Upvotes: 1

Views: 70

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195438

Problem is you are searching in row for all <td>, but these <td> can contain other <td> in your case. One solution is use CSS selectors and search only for <td> which don't have other <td>:

data = '''<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
         <colgroup span="3"></colgroup>
         <tr style="background-color: #CCDDFF;" class="header">
            <td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
            <td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
            <td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
         </tr>
         <tr style="white-space: pre-wrap;background-color: #E4E4E4;">
            <td>BN100175-100861</td>
            <td>
               <table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
                  <tr>
                     <td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
                  </tr>
               </table>
            </td>
            <td>25.53825</td>
         </tr>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

rows = []
for tr in soup.select('#tableid1 > tr'):
    rows.append([td.get_text(strip=True) for td in tr.select('td:not(:has(td))')])

from pprint import pprint
pprint(rows)

Prints:

[['Bundle Name', 'Insulation Name / Layer / Layer PN', 'Bundle Width'],
 ['BN100175-100861', 'B29* / 10 / POLYETHYLENE_CONDUIT', '25.53825']]

The CSS selector #tableid1 > tr will search for all <tr> that are directly under the tag with id=tableid1

The CSS selector td:not(:has(td)) will search for all <td> that don't contain other <td>.

Further reading:

CSS Selectors Reference

Upvotes: 1

Related Questions