Reputation: 21720
I have some text coming from the web as such:
£6.49
Obviously I would like this to be displayed as:
£6.49
I have tried the following so far:
s = url['title']
s = s.encode('utf8')
s = s.replace(u'Â','')
And a few variants on this (after finding it on this very same forum)
But still no luck as I keep getting:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 100: ordinal not in range(128)
Could anyone help me getting this right?
UPDATE:
Adding the repr examples and content type
u'Star Trek XI £3.99'
u'Oscar Winners Best Pictures Box Set \xc2\xa36.49'
Content-Type: text/html; charset=utf-8
Thanks in advance.
Upvotes: 3
Views: 8735
Reputation: 879591
If, s=url['title']
makes s
equal to this:
In [48]: s=u'Oscar Winners Best Pictures Box Set \xc2\xa36.49'
Then the problem is
url
,If Case 1, we'd need to see the code that defines url
.
If Case 2, a quick-and-dirty workaround would be to encode the unicode object s
with the raw-unicode-escape
codec:
In [49]: print(s)
Oscar Winners Best Pictures Box Set £6.49
In [50]: print(s.encode('raw-unicode-escape'))
Oscar Winners Best Pictures Box Set £6.49
See also this SO question.
Regarding titles like s=u'Star Trek XI £3.99'
: Again, it would be nice fix the problem before it gets to this stage -- perhaps by looking at how url
is defined. But assuming the content from the web is mal-formed, a workaround would be:
In [86]: import re
In [87]: print(re.sub(r'&#x([a-fA-F\d]+);',lambda m: unichr(int(m.group(1),base=16)),s))
Star Trek XI £3.99
A little bit of explanation:
Note that
In [51]: x=u'£'
In [53]: x.encode('utf-8')
Out[53]: '\xc2\xa3'
So the unicode object u'£'
, encoded with the utf-8
codec, becomes the string object '\xc2\xa3'
.
Somehow, url['title']
is getting defined to be the unicode object
u'\xc2\xa3'
. (The u
makes a big difference!)
Thus we have u'\xc2\xa3'
when we desire '\xc2\xa3'
.
Encoding the unicode object u'\xc2\xa3'
with the raw-unicode-escape
codec transforms it to '\xc2\xa3'
.
Upvotes: 7
Reputation: 9331
Edit: you have your objects already in unicode. Seems to me there is no reason to actually use enocde/decode at all.
>>> print u'Oscar Winners Best Pictures Box Set \xc2\xa36.49'.replace(u'Â','')
Oscar Winners Best Pictures Box Set £6.49
However it seems to me that something is wrong there. The unicode objects are actually not unicode; see:
>>> print 'Oscar Winners Best Pictures Box Set \xc2\xa36.49'.decode('utf8')
Oscar Winners Best Pictures Box Set £6.49
The repr() you posted should not be unicode object. That's why I was asking where are you getting the data, there is something wrong.
Upvotes: 0