jasonmclose
jasonmclose

Reputation: 1695

Conversion of Unicode string to ASCII in python 2.7

I have an interesting problem.

I am getting a Unicode string passed to a variable, and I want to convert it to a normal ASCII string.

I can't seem to figure out how to do this in Python2.7.

The following works in Python3

rawdata = '\u003c!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"\u003e'
b = bytearray()
b.extend(map(ord, rawdata))
c = ''.join(chr(i) for i in b)

If I call a print(c), I get a nice, clean output:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

But when I call this in Python2.7, it is still printing the Unicode escaped characters (essentially printing the rawdata variable again).

What am I doing wrong? There has got to be a simple call that I'm not making.

Upvotes: 1

Views: 3978

Answers (2)

Manoel Vilela
Manoel Vilela

Reputation: 904

For better portability on both versions, you should use Unidecode, which does exactly what you want.

>>> from unidecode import unidecode
>>> unidecode(u'ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode(u'30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode(u"\u5317\u4EB0")
'Bei Jing '

Upvotes: 0

jasonmclose
jasonmclose

Reputation: 1695

So I literally found the answer 2 minutes after posting this.

The answer is to do the following in Python 2.7

rawdata = '\u003c!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"\u003e'
asciistr = rawdata.decode("raw_unicode_escape")
print asciistr

Upvotes: 1

Related Questions