BS4 returning none from some rows

Question

Summary: BS4 isn't picking up the contents of some td elements, returning None instead of the data they contain.

Detail: I'm trying to scrape an HTML table using BS4 (code below). The table has multiple columns, like this:

The code that i'm using to scrape the table:

page = open("topGainers.html")
soup =  BeautifulSoup(page, "lxml")
page.close()

print(soup('table')[1].findAll('tr')[i].findAll('td')[5].string)
# None

The problem here is that when I run this code, the td containing br tags returns None. I know that this is because it has more than one child, but I'm not able to resolve the issue. Using .text instead of .string returns something like is

[u'28,560.00', 
, u'
								28,027.05']

Expected output:

[u'28,560.00 28,027.05']

How should I go about this?

Keyur Potdar · Accepted Answer

To strip the extra whitespace from the text, you can simply use .get_text(' ', strip=True)

html = '''
    EICHERMOT
    28-Mar-18
    28,079.75
    0.45
    0.00%
    28,560.00

    28,027.05
    28298.25
    49,050

    1,962
    13,880.29
    197,375
    750

        0.38%
>'''

soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('td')[5].get_text(' ', strip=True))
# 28,560.00 28,027.05

BS4 returning none from some rows

Answers (1)

Related Questions