Reputation: 437
I'm new to beautifulsoup and python, and I'm pretty sure this is a dead-simple problem but I can't seem to get anywhere solving it.
I'm trying to loop through rows of an html table, based on "header" rows that group the table by types of candy. My table looks like this:
I want the loop to get the date under each candy heading. So the iterations would get data like this:
first loop iteration: candy_type: kitkat, location: Mall 1, Planned: 63, Actual: 0, Diff: 25
second iteration: candy_type: kitkat, location: Mall 2, Planned: 7, Actual: 0, Diff: 6
... last iteration: candy_type: Skittles, location: Building 2, Planned: 320, Actual: 236, Diff: 0
This is the table code:
<TABLE BORDER="1" WIDTH="100%">
<TR>
<TH COLSPAN=4>Candy</TH>
</TR>
<TR BGCOLOR=#CEE3F6>
<TD COLSPAN=4>
<FONT FACE=Arial>
<center><b>KitKat</b></center>
</FONT>
</TD>
</TR>
<TR BGCOLOR=#336699>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
</TR>
<TR>
<TD>Mall 1</TD>
<TD>63</TD>
<TD>0</TD>
<TD>25</TD>
</TR>
<TR>
<TD>Mall 2</TD>
<TD>7</TD>
<TD>0</TD>
<TD>6</TD>
</TR>
<TR BGCOLOR=#CEE3F6>
<TD COLSPAN=4>
<FONT FACE=Arial>
<center><b>OH Henry</b></center>
</FONT>
</TD>
</TR>
<TR BGCOLOR=#336699>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
</TR>
<TR>
<TD>Warehouse 1</TD>
<TD>195</TD>
<TD>122</TD>
<TD>30</TD>
</TR>
<TR>
<TD>Warehouse 2</TD>
<TD>96</TD>
<TD>76</TD>
<TD>6</TD>
</TR>
<TR BGCOLOR=#CEE3F6>
<TD COLSPAN=4>
<FONT FACE=Arial>
<center><b>Skittles</b></center>
</FONT>
</TD>
</TR>
<TR BGCOLOR=#336699>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
<TD><FONT COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
</TR>
<TR>
<TD>Building 1</TD>
<TD>120</TD>
<TD>90</TD>
<TD>5</TD>
</TR>
<TR>
<TD>Building 2</TD>
<TD>320</TD>
<TD>236</TD>
<TD>0</TD>
</TR>
</TABLE>
so I tried
from bs4 import BeautifulSoup
import urllib
readUrl = urllib.urlopen('test.html').read()
soup = BeautifulSoup(readUrl)
candytype = soup.findAll('tr',{"bgcolor" : "#CEE3F6"})
for type in candytype:
print type
This prints out the three candy types like this:
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>KitKat</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>OH Henry</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>Skittles</b></center>
</td>
</tr>
I thought I could group the candy "headers" (i.e. the tr elements whose bgcolor
set to #CEE3F6
) and then iterate on that basis, but I cannot figure out how to get further into the data.
Any ideas?
Upvotes: 0
Views: 1954
Reputation: 21643
Find all the rows, then iterate through them. When you find one that contains the name of a candy (by the colour of the row), keep that name. Now identify the next siblings of that row. Skip the first, which will be a heading but capture subsequent texts from the td
elements. You know you've found the last sibling when you encounter the name of a different candy (again by the colour of the row).
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('justTable.htm').read(), 'lxml')
>>> trs = soup.findAll('tr')
>>> for tr in trs:
... if 'bgcolor' in tr.attrs and tr.attrs['bgcolor']=='#CEE3F6':
... candy = tr.text.strip()
... first = True
... for sibs in tr.fetchNextSiblings():
... if first:
... first = False
... continue
... if 'bgcolor' in sibs.attrs and sibs.attrs['bgcolor']=='#CEE3F6':
... break
... [candy]+sibs.text.strip().split('\n')
...
['KitKat', 'Mall 1', '63', '0', '25']
['KitKat', 'Mall 2', '7', '0', '6']
['OH Henry', 'Warehouse 1', '195', '122', '30']
['OH Henry', 'Warehouse 2', '96', '76', '6']
['Skittles', 'Building 1', '120', '90', '5']
['Skittles', 'Building 2', '320', '236', '0']
Upvotes: 1