Reputation: 4375
I'm extremely confused over unicode in Python 2.x.
I'm using BeautifulSoup to scrape a webpage, and I'm trying to insert the things I find into a dictionary with the name as the key, and the url as the value.
I'm using BeautifulSoup's find
function to get the info I need. My code started out as follows:
name = i.find('a').string
url = i.find('a').get('href')
This works, with the exception of the thign returned from find
is an Object, and not a string.
Here's were things start confusing me
If I try to convert it to type str
before I assign it to the variable, it sometimes throws an UnicodeEncodeError
.
'ascii' codec can't encode character u'\xa0' in position 5: ordinal not in range(128)
I Google around and find that I should be encoding to ascii
I try adding:
print str(i.find('a').string).encode('ascii', 'ignore')
No luck, still gives an, Unicode Error.
From there, I tried using repr
.
print repr(i.find('a').string)
And that works... almost!
I ran into a new problem here.
Once everything is said and done, and the dictionary is built, I can't bloody access anything! It keeps giving me a KeyError
.
I can loop over the dict:
for i in sorted(data.iterkeys()):
print i
>>> u'Key1'
>>> u'Key2'
>>> u'Key3'
>>> u'Key4'
but if I try to access an item of the dict like this:
print data['key1']
OR
print data[u'key1']
OR
test = unicode('key1')
print data[test]
They all return KeyErrors, which is 100% confusing to me. I assume it's got something to do with them being Unicode objects.
I've tried just about everything I can come up with, but I can't figure out what's going on.
Oh! Adding to the oddity, is that this code:
name = repr(i.find('a').string)
print type(name)
returns
>>> type(str)
but if I just print the thing
print name
it shows it as a unicode string
>>>> u'string name'
Upvotes: 1
Views: 8473
Reputation: 1122142
The .string
value is indeed not a string. You need to cast it to unicode()
:
name = unicode(i.find('a').string)
It's a unicode-like object called NavigableString
. If you really need it to be a str
instead, you can encode it from there:
name = unicode(i.find('a').string).encode('utf8')
or similar. For use in a dict
I'd use unicode()
objects and not encode.
To understand the difference between unicode()
and str()
and what encoding to use, I recommend you read the Python Unicode HOWTO.
Upvotes: 3