Reputation: 2250
I am trying to convert a string just with English characters, numbers and punctuations but facing an error with encoding and decoding.
The original string is: "DD-XBS 2 1/2x 17 LCLξ 3-pack"
The code I wrote to tackle this issue is:
try:
each = str(each.decode('ascii'))
except UnicodeDecodeError:
each = str(each.decode('utf-8').encode('ascii', errors='ignore'))
but I am getting an error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8c in position 16: invalid start byte
How can I solve this?
Upvotes: 0
Views: 548
Reputation: 886
As it follows from your question, I assume that you use Python 2.7.
The reason of the error is:
For better understanding look at that:
>>> u = '\x8c'.decode('cp1252')
>>> u
u'\u0152'
So, when we decode '\x8c' byte with cp1252, there is the Unicode code point, which is:
>>> import unicodedata
>>> unicodedata.name(u)
'LATIN CAPITAL LIGATURE OE'
However, if we try to decode with UTF-8, we'll get an error:
>>> u = '\x8c'.decode('utf-8')
...
UnicodeDecodeError: 'utf8' codec can't decode byte 0x8c ...
So, '\x8c' byte and UTF-8 encoding are incompatible.
To fix the problem you can try this:
each = str(each.decode('cp1252').encode('ascii', errors='ignore'))
Or this:
each = str(each.decode('utf-8', errors='ignore').encode('ascii', errors='ignore'))
Also in your case you can use ord():
my_str = 'DD-XBS 2 1/2x 17 LCLξ 3-pack'
ascii_str = ''
for sign in my_str:
if ord(sign) < 128:
ascii_str += sign
print(ascii_str) # DD-XBS 2 1/2x 17 LCL 3-pack
But possibly the best solution is just to convert your source to UTF-8.
Upvotes: 2