Reputation: 483
Summary: BS4 isn't picking up the contents of some td elements, returning None instead of the data they contain.
Detail: I'm trying to scrape an HTML table using BS4 (code below). The table has multiple columns, like this:
<tr>
<td><b>EICHERMOT</b></td>
<td>28-Mar-18</td>
<td>28,079.75</td>
<td><span class="gr_11" style="color:#0F6C02">0.45</span></td>
<td><span class="gr_11" style="color:#0F6C02">0.00%</span></td>
<td>28,560.00<br>
28,027.05</td>
<td>28298.25</td>
<td>49,050<br>
1,962</td>
<td>13,880.29</td>
<td>197,375</td>
<td><span class="gr_11" style="color:#0F6C02">750<br>
0.38%</span></td>
</tr>
The code that i'm using to scrape the table:
page = open("topGainers.html")
soup = BeautifulSoup(page, "lxml")
page.close()
print(soup('table')[1].findAll('tr')[i].findAll('td')[5].string)
# None
The problem here is that when I run this code, the td
containing br
tags returns None
.
I know that this is because it has more than one child, but I'm not able to resolve the issue. Using .text
instead of .string
returns something like is
[u'28,560.00', <br/>, u'\n\t\t\t\t\t\t\t\t28,027.05']
Expected output:
[u'28,560.00 28,027.05']
How should I go about this?
Upvotes: 1
Views: 693
Reputation: 7248
To strip the extra whitespace from the text, you can simply use .get_text(' ', strip=True)
html = '''<tr>
<td><b>EICHERMOT</b></td>
<td>28-Mar-18</td>
<td>28,079.75</td>
<td><span class="gr_11" style="color:#0F6C02">0.45</span></td>
<td><span class="gr_11" style="color:#0F6C02">0.00%</span></td>
<td>28,560.00<br>
28,027.05</td>
<td>28298.25</td>
<td>49,050<br>
1,962</td>
<td>13,880.29</td>
<td>197,375</td>
<td><span class="gr_11" style="color:#0F6C02">750<br>
0.38%</span></td>
</tr>>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('td')[5].get_text(' ', strip=True))
# 28,560.00 28,027.05
Upvotes: 1