Reputation: 693
I am trying to extract some information from a table which appears on various webpages (My apologies for not disclosing the webpage).
<table class="toccolours" style="font-size: 85%;">
<tbody><tr>
<th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>First sub-class</b></th>
</tr>
<tr>
<td style="padding-right:5px">Info1</td>
<td style="padding-right:5px;"><a title="Object 1">Object 1</a></td>
<td style="text-align:center;padding-right:5px">Info 2</td>
<td style="padding-right:5px"><a title="Object 2">Object 2</a></td>
<td style="padding-right:5px">Info 3</td>
<td style="text-align:center;">Info 4</td>
<td style="text-align:center;">Info 5</td>
<td></td>
</tr>
<tr>
<th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>Second sub-class</b></th>
</tr>
<tr>
<td style="padding-right:5px">Info11</td>
<td style="padding-right:5px;"><a title="Object 11">Object 11</a></td>
<td style="text-align:center;padding-right:5px">Info 22</td>
<td style="padding-right:5px"><a title="Object 22">Object 22</a></td>
<td style="padding-right:5px">Info 33</td>
<td style="text-align:center;">Info 44</td>
<td style="text-align:center;">Info 55</td>
<td></td>
</tr>
<tr>
<th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>Third sub-class</b></th>
</tr>
<tr>
<td style="padding-right:5px">Info 111</td>
<td style="padding-right:5px;"><a title="Object 111">Object 111</a></td>
<td style="text-align:center;padding-right:5px">Info 222</td>
<td style="padding-right:5px">Object 222</td>
<td style="padding-right:5px">Info 333</td>
<td style="text-align:center;">Info 444</td>
<td style="text-align:center;">Info 555</td>
<td></td>
</tr>
</tbody></table>
Where the table essentially looks like the following:
Image 1
The problem is that both the the sub-classes and the number of rows for each subclass may change. So for example, the First sub-class in some cases may have 1 items, Second sub-class may have 3 items and the third sub-class may have 2 items. Additionally I may also get a table with only sub-class 1 and 2.
For example:
Image 2
OR
Image 3
are also possible.
I want to get the data in a format so that the sub-class value appears alongside the relevant info rows in the following format (following image 1):
Image 4
However I am a little stuck with how to achieve this on python since the table headings are not separate classes under which each row item appears. I can call a webpage on using web driver and extract the page source using beautiful soup. However, I could not figure out how to assign the sub-classes to the rows in this case (especially since the info rows do not appear as an element of the subclass row but merely as a new row of the table).
As of now, I can use .find_all('tr') to get all the rows of the tables. However since the sub-classes and the number of rows are not uniform across the (500 or so tables), I cannot seem to get my head around to getting the data. Any help would be appreciated.
p
Upvotes: 0
Views: 1226
Reputation: 1859
I used lxml
hope it fits with your problem
from lxml import etree
html_body = """
<table class="toccolours" style="font-size: 85%;">
<tbody>
<tr>
<th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>First sub-class</b></th>
</tr>
<tr>
<td style="padding-right:5px">Info1</td>
<td style="padding-right:5px;"><a title="Object 1">Object 1</a></td>
<td style="text-align:center;padding-right:5px">Info 2</td>
<td style="padding-right:5px"><a title="Object 2">Object 2</a></td>
<td style="padding-right:5px">Info 3</td>
<td style="text-align:center;">Info 4</td>
<td style="text-align:center;">Info 5</td>
<td></td>
</tr>
<tr>
<th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>Second sub-class</b></th>
</tr>
<tr>
<td style="padding-right:5px">Info11</td>
<td style="padding-right:5px;"><a title="Object 11">Object 11</a></td>
<td style="text-align:center;padding-right:5px">Info 22</td>
<td style="padding-right:5px"><a title="Object 22">Object 22</a></td>
<td style="padding-right:5px">Info 33</td>
<td style="text-align:center;">Info 44</td>
<td style="text-align:center;">Info 55</td>
<td></td>
</tr>
<tr>
<th colspan="8" style="background-color: #ccf; color: #000080; text-align:center;"><b>Third sub-class</b></th>
</tr>
<tr>
<td style="padding-right:5px">Info 111</td>
<td style="padding-right:5px;"><a title="Object 111">Object 111</a></td>
<td style="text-align:center;padding-right:5px">Info 222</td>
<td style="padding-right:5px">Object 222</td>
<td style="padding-right:5px">Info 333</td>
<td style="text-align:center;">Info 444</td>
<td style="text-align:center;">Info 555</td>
<td></td>
</tr>
</tbody></table>
"""
tableData = {}
tree = etree.fromstring(html_body, parser=etree.HTMLParser())
for i in tree.xpath("//tr/th[@colspan]"):
className = i.getchildren()[0].text
tableData[className] = []
parentTag = i.getparent()
tableBody = parentTag.getnext().xpath('td')
for cell in tableBody:
if cell.text:
tableData[className].append(cell.text)
else:
child_tag = cell.getchildren()
if child_tag:
tableData[className].append(child_tag[0].text)
print tableData
Output:
>>> {'Second sub-class': ['Info11', 'Object 11', 'Info 22', 'Object 22', 'Info 33', 'Info 44', 'Info 55'], 'First sub-class': ['Info1', 'Object 1', 'Info 2', 'Object 2', 'Info 3', 'Info 4', 'Info 5'], 'Third sub-class': ['Info 111', 'Object 111', 'Info 222', 'Object 222', 'Info 333', 'Info 444', 'Info 555']}
Upvotes: 1
Reputation: 5560
Just process your HTML row by row:
b = bs4.BeautifulSoup(html)
data = {}
current = None
for row in b.find_all('tr'):
if row.find_all('th'):
# this is a header
current = row.find_all('th')[0].text
else:
# this is not a header, therefore is data under the last header seen
data[current] = row.find_all('td') # do whatever processing you need to do here, you did't specify
If you need to preserve the order of your headers, instead of a dict, use a list of lists:
data = []
headers = []
for row in b.find_all('tr'):
if row.find_all('th'):
# this is a header
headers.append(row.find_all('th')[0].text)
data.append([])
else:
# this is not a header, therefore is data under the last header seen
data[-1].append(row.find_all('td'))
print zip(headers,data)
Upvotes: 1