Reputation: 105
I'm trying to scrape an HTML table using BS4
Python, but for numbers formatted like this 247 759 384
(read as 247759384
) in the HTML are appearing differently in python. I would like to output them as they are in the table.
temp = []
a = soup.findAll('tr')[1]
for td in a.find_all("td"):
temp.append(str(td.text))
#print(str(td.text))
a.findAll('td')[10].text
gives me an output of
'24\xa0081\xa0728'
instead of
24081728
Upvotes: 0
Views: 134
Reputation: 319
The \xa0
is a no-breaking space in unicode.
A quick way to fix it is just to replace it with an ascii space (using say string.replace(u'\xa0', ' ')
). But be careful, there may be other encoding issues you haven't spotted yet.
I'd recommend reading this first and handle it in a cleaner way.
Upvotes: 0
Reputation: 3018
Just check if your characters are numbers using isnumeric()
string='24\xa0081\xa0728'
''.join(e for e in string if e.isnumeric())
'24081728'
Upvotes: 1