Reputation: 1769
I am trying to extract a table from a web page. Below is the HTML and Python code using beautifulsoup. The code below always worked for me, but in this case I get blank. Thanks in advance.
<table>
<thead>
<tr>
<th>Period Ending:</th>
<th class="TalignL">Trend</th>
<th>9/27/2014</th>
<th>9/28/2013</th>
<th>9/29/2012</th>
<th>9/24/2011</th>
</tr>
</thead>
<tr>
<th bgcolor="#E6E6E6">Total Revenue</th>
<td class="td_genTable"><table border="0" align="center" width="*" cellspacing="0" cellpadding="0"><tr><td align="bottom"><table border="0" height="100%" cellspacing="0" cellpadding="0"><tr><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="15" bgcolor="#47C3D3" width="6"></td><td height="15" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="1" bgcolor="#FFFFFF" width="6"></td><td height="1" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="14" bgcolor="#47C3D3" width="6"></td><td height="14" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="2" bgcolor="#FFFFFF" width="6"></td><td height="2" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="13" bgcolor="#47C3D3" width="6"></td><td height="13" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="7" bgcolor="#FFFFFF" width="6"></td><td height="7" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="8" bgcolor="#47C3D3" width="6"></td><td height="8" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="1" bgcolor="#D1D1D1"></td></tr></table></td></tr></table></td></tr></table></td>
<td>$182,795,000</td>
<td>$170,910,000</td>
<td>$156,508,000</td>
<td>$108,249,000</td>
rows = table.findAll('tr')
for row in rows:
cols = row.findAll('td')
col1 = [ele.text.strip().replace(',','') for ele in cols]
account = col1[0:1]
period1 = col1[2:3]
period2 = col1[3:4]
period3 = col1[4:5]
record = (stock, account,period1,period3,period3)
print record
Upvotes: 1
Views: 2395
Reputation: 474221
Adding to what @abarnert pointed out. I would get all the td
elements with text starting with $
:
for row in soup.table.find_all('tr', recursive=False):
record = [td.text.replace(",", "") for td in row.find_all("td", text=lambda x: x and x.startswith("$"))]
print record
For the input you've provided, it prints:
[u'$182795000', u'$170910000', u'$156508000', u'$108249000']
which you can "unpack" into separate variables:
account, period1, period2, period3 = record
Note that I'm explicitly passing recursive=False
to avoid going deeper in the tree and get only direct tr
children of the table
element.
Upvotes: 2
Reputation: 366133
Your first problem is that find_all
(or findAll
, which is just a deprecated synonym for the same thing) doesn't just find the rows in the table, it finds the rows in the table and in every subtable within it. You almost certainly don't want to iterate over both kinds of rows and run the same code on each one. If you don't want that, as the recursive
argument docs say, pass recursive=False
.
So, now you get back only one row. If you do row.find_all('td')
, that's going to have the same problem again—you're going to find all of the columns of this row, and all of the columns of every row in every subtable within one of those columns. Again, that's not what you want, so use recursive=False
.
And now you get back only 5 columns. The first one is just a big table with a bunch of empty cells in it; the others, on the other hand, have dollar values in them, which seem to be the ones you want.
So, just adding recursive=False
to both calls, and setting stock
to something (I don't know where it's supposed to come from in your code, but without it you're obviously going to just get a NameError
):
stock = 'spam'
rows = table.find_all('tr', recursive=False)
for row in rows:
cols = row.findAll('td', recursive=False)
col1 = [ele.text.strip().replace(',','') for ele in cols]
account = col1[0:1]
period1 = col1[2:3]
period2 = col1[3:4]
period3 = col1[4:5]
record = (stock, account,period1,period3,period3)
print record
This will print:
('spam', [''], ['$170910000'], ['$108249000'], ['$108249000'])
I'm not sure why you used period3
twice and never used period2
, why you skipped over column 1 entirely, or why you sliced 1-element lists instead of just indexing the values, but anyway, this seems to be what you were trying to do.
As a side note, if you actually want to break out the list into 5 values, rather than into 4 1-element lists skipping one of the values, you can write:
account, whatever, period1, period2, period3 = col
Upvotes: 1