Reputation: 105

Scraping oddly formatted numbers using Beautifulsoup

I'm trying to scrape an HTML table using BS4 Python, but for numbers formatted like this 247 759 384 (read as 247759384) in the HTML are appearing differently in python. I would like to output them as they are in the table.

temp = []
a = soup.findAll('tr')[1]
for td in a.find_all("td"):
    temp.append(str(td.text))
    #print(str(td.text))
a.findAll('td')[10].text

gives me an output of

'24\xa0081\xa0728'

instead of

24081728

Upvotes: 0

Answers (2)

DaPanda

Reputation: 319

The \xa0 is a no-breaking space in unicode.

A quick way to fix it is just to replace it with an ascii space (using say string.replace(u'\xa0', ' ')). But be careful, there may be other encoding issues you haven't spotted yet.

I'd recommend reading this first and handle it in a cleaner way.

Upvotes: 0

Sruthi

Reputation: 3018

Just check if your characters are numbers using isnumeric()

string='24\xa0081\xa0728'
''.join(e for e in string if e.isnumeric())
'24081728'

Upvotes: 1

Scraping oddly formatted numbers using Beautifulsoup

Answers (2)

Related Questions