Reputation: 47
I am loading an HTML file into a data frame using BeautifulSoup. The table that I am parsing contains a nested table in every row, and I'm not sure how to handle this as it's giving me an AssertionError...trying to load 4 columns when there are only 3 columns in the data frame.
Here is the beginning of the html table showing the headers and the first row of data:
<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
<colgroup span="3"></colgroup>
<tr style="background-color: #CCDDFF;" class="header">
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
</tr>
<tr style="white-space: pre-wrap;background-color: #E4E4E4;">
<td>BN100175-100861</td>
<td>
<table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
<tr>
<td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
</tr>
</table>
</td>
<td>25.53825</td>
</tr>
Below is the code that I wrote to read the data into a dataframe:
table = soup.find('table', id = 'tableid1')
table_rows = table.find_all('tr')
allData=[]
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
allData.append(row)
headers = allData.pop(0)
self.d1_bundle_df = pd.DataFrame(allData, columns = headers)
When the above code is running, it generates the following error: AssertionError: 3 columns passed, passed data had 4 columns
What's the best way to handle these nested tables? This is still relatively new to me, so any direction would be greatly appreciated.
Upvotes: 1
Views: 70
Reputation: 195438
Problem is you are searching in row for all <td>
, but these <td>
can contain other <td>
in your case. One solution is use CSS selectors and search only for <td>
which don't have other <td>
:
data = '''<table border="0" cellpadding="0" cellspacing="0" width="99%" style="font-family:Helvetica;font-size:12" id="tableid1">
<colgroup span="3"></colgroup>
<tr style="background-color: #CCDDFF;" class="header">
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Name</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Insulation Name / Layer / Layer PN</td>
<td style="vertical-align:top;text-align:left; padding: 0px; font-weight: bold; " width="33%">Bundle Width</td>
</tr>
<tr style="white-space: pre-wrap;background-color: #E4E4E4;">
<td>BN100175-100861</td>
<td>
<table border="0" cellpadding="0" cellspacing="0" style="font-family:Helvetica;font-size:12">
<tr>
<td>B29* / 10 / POLYETHYLENE_CONDUIT</td>
</tr>
</table>
</td>
<td>25.53825</td>
</tr>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
rows = []
for tr in soup.select('#tableid1 > tr'):
rows.append([td.get_text(strip=True) for td in tr.select('td:not(:has(td))')])
from pprint import pprint
pprint(rows)
Prints:
[['Bundle Name', 'Insulation Name / Layer / Layer PN', 'Bundle Width'],
['BN100175-100861', 'B29* / 10 / POLYETHYLENE_CONDUIT', '25.53825']]
The CSS selector #tableid1 > tr
will search for all <tr>
that are directly under the tag with id=tableid1
The CSS selector td:not(:has(td))
will search for all <td>
that don't contain other <td>
.
Further reading:
Upvotes: 1