Ahmed Meeran
Ahmed Meeran

Reputation: 105

Scraping oddly formatted numbers using Beautifulsoup

I'm trying to scrape an HTML table using BS4 Python, but for numbers formatted like this 247 759 384 (read as 247759384) in the HTML are appearing differently in python. I would like to output them as they are in the table.

temp = []
a = soup.findAll('tr')[1]
for td in a.find_all("td"):
    temp.append(str(td.text))
    #print(str(td.text))
a.findAll('td')[10].text

gives me an output of

'24\xa0081\xa0728'

instead of

24081728

Upvotes: 0

Views: 134

Answers (2)

DaPanda
DaPanda

Reputation: 319

The \xa0 is a no-breaking space in unicode.

A quick way to fix it is just to replace it with an ascii space (using say string.replace(u'\xa0', ' ')). But be careful, there may be other encoding issues you haven't spotted yet.

I'd recommend reading this first and handle it in a cleaner way.

Upvotes: 0

Sruthi
Sruthi

Reputation: 3018

Just check if your characters are numbers using isnumeric()

string='24\xa0081\xa0728'
''.join(e for e in string if e.isnumeric())
'24081728'

Upvotes: 1

Related Questions