PyNEwbie
PyNEwbie

Reputation: 4950

How to work with unicode in Python

I am trying to clean all of the HTML out of a string so the final output is a text file. I have some some research on the various 'converters' and am starting to lean towards creating my own dictionary for the entities and symbols and running a replace on the string. I am considering this because I want to automate the process and there is a lot of variability in the quality of the underlying html. To begin comparing the speed of my solution and one of the alternatives for example pyparsing I decided to test replace of \xa0 using the string method replace. I get a

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

The actual line of code was

s=unicodestring.replace('\xa0','')

Anyway-I decided that I needed to preface it with an r so I ran this line of code:

s=unicodestring.replace(r'\xa0','')

It runs without error but I when I look at a slice of s I see that the \xaO is still there

Upvotes: 15

Views: 19441

Answers (6)

Tejas Tank
Tejas Tank

Reputation: 1216

Instead of this, it's better to use standard python features.

For example:

string = unicode('Hello, \xa0World', 'utf-8', 'replace')

or

string = unicode('Hello, \xa0World', 'utf-8', 'ignore')

where replace will replace \xa0 to \\xa0.

But if \xa0 is really not meaningful for you and you want to remove it then use ignore.

Upvotes: 2

dbr
dbr

Reputation: 169703

s=unicodestring.replace('\xa0','')

..is trying to create the unicode character \xa0, which is not valid in an ASCII sctring (the default string type in Python until version 3.x)

The reason r'\xa0' did not error is because in a raw string, escape sequences have no effect. Rather than trying to encode \xa0 into the unicode character, it saw the string as a "literal backslash", "literal x" and so on..

The following are the same:

>>> r'\xa0'
'\\xa0'
>>> '\\xa0'
'\\xa0'

This is something resolved in Python v3, as the default string type is unicode, so you can just do..

>>> '\xa0'
'\xa0'

I am trying to clean all of the HTML out of a string so the final output is a text file

I would strongly recommend BeautifulSoup for this. Writing an HTML cleaning tool is difficult (given how horrible most HTML is), and BeautifulSoup does a great job at both parsing HTML, and dealing with Unicode..

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup("<html><body><h1>Hi</h1></body></html>")
>>> print soup.prettify()
<html>
 <body>
  <h1>
   Hi
  </h1>
 </body>
</html>

Upvotes: 6

z33m
z33m

Reputation: 6053

may be you should be doing

s=unicodestring.replace(u'\xa0',u'')

Upvotes: 25

Jason Coon
Jason Coon

Reputation: 18441

You can convert it to unicode in this way:

print u'Hello, \xa0World'  # print Hello,  World

Upvotes: 0

&#211;lafur Waage
&#211;lafur Waage

Reputation: 70021

Just a note regarding HTML cleaning. It is very very hard, since

<
body
>

Is a valid way to write HTML. Just an fyi.

Upvotes: 1

Wayne Koorts
Wayne Koorts

Reputation: 11122

Look at the codecs standard library, specifically the encode and decode methods provided in the Codec base class.

There's also a good article here that puts it all together.

Upvotes: 3

Related Questions