Inderpartap Cheema
Inderpartap Cheema

Reputation: 483

BS4 returning none from some rows

Summary: BS4 isn't picking up the contents of some td elements, returning None instead of the data they contain.

Detail: I'm trying to scrape an HTML table using BS4 (code below). The table has multiple columns, like this:

<tr>
    <td><b>EICHERMOT</b></td>
    <td>28-Mar-18</td>
    <td>28,079.75</td>
    <td><span class="gr_11" style="color:#0F6C02">0.45</span></td>
    <td><span class="gr_11" style="color:#0F6C02">0.00%</span></td>
    <td>28,560.00<br>
    28,027.05</td>
    <td>28298.25</td>
    <td>49,050<br>
    1,962</td>
    <td>13,880.29</td>
    <td>197,375</td>
    <td><span class="gr_11" style="color:#0F6C02">750<br>
        0.38%</span></td>
</tr>

The code that i'm using to scrape the table:

page = open("topGainers.html")
soup =  BeautifulSoup(page, "lxml")
page.close()

print(soup('table')[1].findAll('tr')[i].findAll('td')[5].string)
# None

The problem here is that when I run this code, the td containing br tags returns None. I know that this is because it has more than one child, but I'm not able to resolve the issue. Using .text instead of .string returns something like is

[u'28,560.00', <br/>, u'\n\t\t\t\t\t\t\t\t28,027.05']

Expected output:

[u'28,560.00 28,027.05']

How should I go about this?

Upvotes: 1

Views: 693

Answers (1)

Keyur Potdar
Keyur Potdar

Reputation: 7248

To strip the extra whitespace from the text, you can simply use .get_text(' ', strip=True)

html = '''<tr>
    <td><b>EICHERMOT</b></td>
    <td>28-Mar-18</td>
    <td>28,079.75</td>
    <td><span class="gr_11" style="color:#0F6C02">0.45</span></td>
    <td><span class="gr_11" style="color:#0F6C02">0.00%</span></td>
    <td>28,560.00<br>
    28,027.05</td>
    <td>28298.25</td>
    <td>49,050<br>
    1,962</td>
    <td>13,880.29</td>
    <td>197,375</td>
    <td><span class="gr_11" style="color:#0F6C02">750<br>
        0.38%</span></td>
</tr>>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('td')[5].get_text(' ', strip=True))
# 28,560.00 28,027.05

Upvotes: 1

Related Questions