Reputation: 7101
I am currently using Beautiful Soup to parse an HTML file and calling get_text()
, but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?
I tried using: line = line.replace(u'\xa0',' ')
, as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):
EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8')
, but just doing .encode('utf-8')
without replace()
seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?
Upvotes: 385
Views: 538045
Reputation: 3145
After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0
characters from parsed HTML string.
Assume we have our raw html as following:
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
So lets try to clean this HTML string:
from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'
The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.
Method # 1 (Recommended):
The first one is BeautifulSoup's get_text
method with strip
argument as True
So our code becomes:
clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks
Method # 2:
The other option is to use python's library unicodedata
, specifically unicodedata.normalize
import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'
I have also detailed these methods on this blog which you may want to refer.
Upvotes: 46
Reputation: 33
Was facing the same issue, got this done and went well.
df = df.replace(u'\xa0', u'', regex=True)
All instances of \xa0
get replaced.
Upvotes: 2
Reputation: 13265
In Python, \xa0
is a character escape sequence that represents a non-breaking space.
A non-breaking space is a space character that prevents line breaks and word wrapping between two words separated by it.
You can get rid of them by running replace
on a string which contains them:
my_string.replace('\xa0', '') # no more xa0
Upvotes: 12
Reputation: 1301
This is how I solved this issue as I encountered \xao in html encoded string.
I discovered a None breaking space is inserted to ensure that a word and subsequent HTML markup is not separated due to resizing of a page.
This presents a problem for the parsing code as it introduced codec encoding issues. What made it hard was that we are not privy to the encoding used. From Windows machines it can be latin-1 or CP1252 (Western ISO), but more recent OSes have standardized to UTF-8. By normalizing unicode data, we strip \xa0
my_string = unicodedata.normalize('NFKD', my_string).encode('ASCII', 'ignore')
Upvotes: 1
Reputation: 27383
\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.
string = string.replace(u'\xa0', u' ')
When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.
Read up on http://docs.python.org/howto/unicode.html.
Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize
now
Upvotes: 436
Reputation: 1844
Python recognize it like a space character, so you can split
it without args and join by a normal whitespace:
line = ' '.join(line.split())
Upvotes: 17
Reputation: 7
Generic version with the regular expression (It will remove all the control characters):
import re
def remove_control_chart(s):
return re.sub(r'\\x..', '', s)
Upvotes: 1
Reputation: 429
Try this code
import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()
Upvotes: 13
Reputation: 3488
There's many useful things in Python's unicodedata
library. One of them is the .normalize()
function.
Try:
new_str = unicodedata.normalize("NFKD", unicode_str)
Replacing NFKD with any of the other methods listed in the link above if you don't get the results you're after.
Upvotes: 338
Reputation: 537
Try using .strip() at the end of your line
line.strip()
worked well for me
Upvotes: 32
Reputation: 23361
I end up here while googling for the problem with not printable character. I use MySQL UTF-8
general_ci
and deal with polish language. For problematic strings I have to procced as follows:
text=text.replace('\xc2\xa0', ' ')
It is just fast workaround and you probablly should try something with right encoding setup.
Upvotes: 9
Reputation: 71
In Beautiful Soup, you can pass get_text()
the strip parameter, which strips white space from the beginning and end of the text. This will remove \xa0
or any other white space if it occurs at the start or end of the string. Beautiful Soup replaced an empty string with \xa0
and this solved the problem for me.
mytext = soup.get_text(strip=True)
Upvotes: 7
Reputation:
I ran into this same problem pulling some data from a sqlite3 database with python. The above answers didn't work for me (not sure why), but this did: line = line.decode('ascii', 'ignore')
However, my goal was deleting the \xa0s, rather than replacing them with spaces.
I got this from this super-helpful unicode tutorial by Ned Batchelder.
Upvotes: 15
Reputation: 6213
0xA0 (Unicode) is 0xC2A0 in UTF-8. .encode('utf8')
will just take your Unicode 0xA0 and replace with UTF-8's 0xC2A0. Hence the apparition of 0xC2s... Encoding is not replacing, as you've probably realized now.
Upvotes: 4