David
David

Reputation: 382

Python: difficulty converting ascii to unicode

My goal: get the page source from a url and count all instances of a keyword within that page source

How I am doing it: getting the pagesource via urllib2, looping through each char of the page source and comparing it to the keyword

My problem: my keyword is encoded in utf-8 while the page source is in ascii... I am running into errors whenever I try conversions.

getting the page source:

import urllib2
response = urllib2.urlopen(myUrl)
return response.read()

comparing page source and keyword:

pageSource[i] == keyWord[j]

I need to convert one of these strings to the other's encoding. Intuitively I felt that ascii (the page source) to utf-8 (the key word) would be the best and easiest, so:

    pageSource = unicode(pageSource)
UnicodeDecodeError: 'ascii' codec can't decode byte __ in position __: ordinal not in range(128)

Upvotes: 2

Views: 288

Answers (2)

Alastair McCormack
Alastair McCormack

Reputation: 27704

I'll assume your remote "source page" contains more than just ASCII otherwise your comparison will already work as is (ASCII is now a subset of UTF-8. I.e. A in ASCII is 0x41, which is the same as UTF-8).

You may find Python Requests library easier as it will automatically decode remote content to Unicode strings based on the server's headers (Unicode strings are encoding neutral so can be compared without worrying about encoding).

resp = requests.get("http://www.example.com/utf8page.html")
resp.text
>> u'My unicode data €'

You will then need to decode your reference data:

keyWord[j] = "€".decode("UTF-8")
keyWord[j]
>> u'€'

If you're embedding non-ASCII in your source code, you need to define the encoding you're using. For example, at the top of your source code/script:

# coding=UTF-8

Upvotes: 0

Martijn Pieters
Martijn Pieters

Reputation: 1121276

When trying to work with text, don't leave your data as byte strings. Decode to Unicode early, encode back to bytes as late as possible.

Decode your downloaded network data:

import urllib2

response = urllib2.urlopen(myUrl)
# Latin-1 is the default for HTTP text/ responses, adjust as needed
codec = response.info().getparam('charset', 'latin1')
return response.read().decode(codec)

and do the same for your keyWord data. If it is encoded as UTF-8, decode it as such, or use Unicode string literals.

You may want to read up on Python and Unicode:

Upvotes: 2

Related Questions