prokaryote
prokaryote

Reputation: 437

python beautifulsoup loop through table rows by section

I'm new to beautifulsoup and python, and I'm pretty sure this is a dead-simple problem but I can't seem to get anywhere solving it.

I'm trying to loop through rows of an html table, based on "header" rows that group the table by types of candy. My table looks like this: enter image description here

I want the loop to get the date under each candy heading. So the iterations would get data like this:

first loop iteration: candy_type: kitkat, location: Mall 1, Planned: 63, Actual: 0, Diff: 25

second iteration: candy_type: kitkat, location: Mall 2, Planned: 7, Actual: 0, Diff: 6

... last iteration: candy_type: Skittles, location: Building 2, Planned: 320, Actual: 236, Diff: 0

This is the table code:

<TABLE BORDER="1" WIDTH="100%">
   <TR>
      <TH COLSPAN=4>Candy</TH>
   </TR>
   <TR BGCOLOR=#CEE3F6>
      <TD COLSPAN=4>
         <FONT  FACE=Arial>
            <center><b>KitKat</b></center>
         </FONT>
      </TD>
   </TR>
   <TR BGCOLOR=#336699>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
   </TR>
   <TR>
      <TD>Mall 1</TD>
      <TD>63</TD>
      <TD>0</TD>
      <TD>25</TD>
   </TR>
   <TR>
      <TD>Mall 2</TD>
      <TD>7</TD>
      <TD>0</TD>
      <TD>6</TD>
   </TR>
   <TR BGCOLOR=#CEE3F6>
      <TD COLSPAN=4>
         <FONT  FACE=Arial>
            <center><b>OH Henry</b></center>
         </FONT>
      </TD>
   </TR>
   <TR BGCOLOR=#336699>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
   </TR>
   <TR>
      <TD>Warehouse 1</TD>
      <TD>195</TD>
      <TD>122</TD>
      <TD>30</TD>
   </TR>
   <TR>
      <TD>Warehouse 2</TD>
      <TD>96</TD>
      <TD>76</TD>
      <TD>6</TD>
   </TR>
   <TR BGCOLOR=#CEE3F6>
      <TD COLSPAN=4>
         <FONT  FACE=Arial>
            <center><b>Skittles</b></center>
         </FONT>
      </TD>
   </TR>
   <TR BGCOLOR=#336699>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>LOCATION</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>PLANNED</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>ACTUAL</FONT></TD>
      <TD><FONT  COLOR=White FACE=Arial SIZE=-2>DIFF</FONT></TD>
   </TR>
   <TR>
      <TD>Building 1</TD>
      <TD>120</TD>
      <TD>90</TD>
      <TD>5</TD>
   </TR>
   <TR>
      <TD>Building 2</TD>
      <TD>320</TD>
      <TD>236</TD>
      <TD>0</TD>
   </TR>
</TABLE> 

so I tried

from bs4 import BeautifulSoup
import urllib

readUrl = urllib.urlopen('test.html').read()
soup = BeautifulSoup(readUrl)
candytype = soup.findAll('tr',{"bgcolor" : "#CEE3F6"})
for type in candytype:
    print type

This prints out the three candy types like this:

<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>KitKat</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>OH Henry</b></center>
</td>
</tr>
<tr bgcolor="#CEE3F6">
<td colspan="4">
<font face="Arial">
</font><center><b>Skittles</b></center>
</td>
</tr>

I thought I could group the candy "headers" (i.e. the tr elements whose bgcolor set to #CEE3F6) and then iterate on that basis, but I cannot figure out how to get further into the data.

Any ideas?

Upvotes: 0

Views: 1954

Answers (1)

Bill Bell
Bill Bell

Reputation: 21643

Find all the rows, then iterate through them. When you find one that contains the name of a candy (by the colour of the row), keep that name. Now identify the next siblings of that row. Skip the first, which will be a heading but capture subsequent texts from the td elements. You know you've found the last sibling when you encounter the name of a different candy (again by the colour of the row).

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('justTable.htm').read(), 'lxml')
>>> trs = soup.findAll('tr')
>>> for tr in trs:
...     if 'bgcolor' in tr.attrs and tr.attrs['bgcolor']=='#CEE3F6':
...         candy = tr.text.strip()
...         first = True
...         for sibs in tr.fetchNextSiblings():
...             if first:
...                 first = False
...                 continue
...             if 'bgcolor' in sibs.attrs and sibs.attrs['bgcolor']=='#CEE3F6':
...                 break   
...             [candy]+sibs.text.strip().split('\n')
... 
['KitKat', 'Mall 1', '63', '0', '25']
['KitKat', 'Mall 2', '7', '0', '6']
['OH Henry', 'Warehouse 1', '195', '122', '30']
['OH Henry', 'Warehouse 2', '96', '76', '6']
['Skittles', 'Building 1', '120', '90', '5']
['Skittles', 'Building 2', '320', '236', '0']

Upvotes: 1

Related Questions