user1353035
user1353035

Reputation: 51

Why won't Python display this text correctly? (UTF-8 Decoding Issue)

import urllib.request as u

zipcode = str(47401)
url = 'http://watchdog.net/us/?zip=' + zipcode
con = u.urlopen(url)

page = str(con.read())
value3 = int(page.find("<title>")) + 7
value4 = int(page.find("</title>")) - 15
district = str(page[value3:value4])
print(district)
newdistrict = district.replace("\xe2\x80\x99","'")
print(newdistrict)

For some reason, my code is pulling in the title in the following format: IN-09: Indiana\xe2\x80\x99s 9th. I know the \xe string of characters is unicode for the ' symbol, but I can't figure out how to get python to replace that set of characters with the ' symbol. I've tried decoding the string but it's already in unicode and the replace code above doesn't change anything. Any advice as to what I'm doing incorrectly?

Upvotes: 4

Views: 20417

Answers (2)

jojo
jojo

Reputation: 3609

try this

newdistrict = district.encode("**THE_INPUT_STRING_ENCODING**").replace("\\xe2\\x80\\x99","'")

i think that you are using utf-8 so it shoud look like this

newdistrict = district.encode("utf-8").replace("\\xe2\\x80\\x99","'")

but this isn't the correct why to work with unicode. once your text is imported into the program you should work in unicode all over the place except maybe when you output as the output should consider the external destination

so a better why is to add line at the top of your script

# -*- coding: utf-8 -*-

read you input as utf-8

page = con.read().decode('utf-8')

and then do newdistrict = district.replace(u"YOUR_UNICODE_STRING","'")

for example

newdistrict = district.replace(u"דכעדחלגעדיל","'")

for more help read this

http://docs.python.org/howto/unicode.html

Upvotes: -1

Chris Morgan
Chris Morgan

Reputation: 90742

When you call con.text(), this returns a bytes object. Calling str() on it returns a string of the representation of it - thus, the escapes are used rather than the real characters, if you don't specify an encoding. (That means that your string ends up containing \\xe2\\x80\\x99 as well as all sorts of other undesired things.) bytes is mostly like str in Python 2: it doesn't have any encoding information stored. str in Python 3 is like unicode in Python 2; it has the encoding. So, when turning a bytes object into a str object, you need to tell it what encoding it is actually in. In this case, that's utf-8.

Instead of calling str() on it, you would be better to use bytes.decode; it's the same thing, just neater.

>>> import urllib.request as u
>>> zipcode = 47401
>>> url = 'http://watchdog.net/us/?zip={}'.format(zipcode)
>>> con = u.urlopen(url)
>>> page = con.read().decode('utf-8')
>>> page[page.find("<title>") + 7:page.find("</title>") - 15]
'IN-09: Indiana’s 9th'

The only functional change that has been made here is the specification to decode the bytes object as 'utf-8'.

Upvotes: 6

Related Questions