Reputation: 382
My goal: get the page source from a url and count all instances of a keyword within that page source
How I am doing it: getting the pagesource via urllib2, looping through each char of the page source and comparing it to the keyword
My problem: my keyword is encoded in utf-8 while the page source is in ascii... I am running into errors whenever I try conversions.
getting the page source:
import urllib2
response = urllib2.urlopen(myUrl)
return response.read()
comparing page source and keyword:
pageSource[i] == keyWord[j]
I need to convert one of these strings to the other's encoding. Intuitively I felt that ascii (the page source) to utf-8 (the key word) would be the best and easiest, so:
pageSource = unicode(pageSource)
UnicodeDecodeError: 'ascii' codec can't decode byte __ in position __: ordinal not in range(128)
Upvotes: 2
Views: 288
Reputation: 27704
I'll assume your remote "source page" contains more than just ASCII otherwise your comparison will already work as is (ASCII is now a subset of UTF-8. I.e. A in ASCII is 0x41, which is the same as UTF-8).
You may find Python Requests library easier as it will automatically decode remote content to Unicode strings based on the server's headers (Unicode strings are encoding neutral so can be compared without worrying about encoding).
resp = requests.get("http://www.example.com/utf8page.html")
resp.text
>> u'My unicode data €'
You will then need to decode your reference data:
keyWord[j] = "€".decode("UTF-8")
keyWord[j]
>> u'€'
If you're embedding non-ASCII in your source code, you need to define the encoding you're using. For example, at the top of your source code/script:
# coding=UTF-8
Upvotes: 0
Reputation: 1121276
When trying to work with text, don't leave your data as byte strings. Decode to Unicode early, encode back to bytes as late as possible.
Decode your downloaded network data:
import urllib2
response = urllib2.urlopen(myUrl)
# Latin-1 is the default for HTTP text/ responses, adjust as needed
codec = response.info().getparam('charset', 'latin1')
return response.read().decode(codec)
and do the same for your keyWord
data. If it is encoded as UTF-8, decode it as such, or use Unicode string literals.
You may want to read up on Python and Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Pragmatic Unicode by Ned Batchelder
Upvotes: 2