Reputation: 1
I am creating a dictionary that requires each letter of a string separated by whitespace. I am using join. The problem is when the string contains non-ascii characters. Join breaks them into two characters and the results is garbage.
Example:
>>> word = 'məsjø'
>>> ' '.join(word)
Gives me:
'm \xc9 \x99 s j \xc3 \xb8'
When what I want is:
'm ə s j ø'
Or even:
'm \xc9\x99 s j \xc3\xb8'
Upvotes: 0
Views: 1767
Reputation: 5580
You should use unicode strings, i.e.
word = u'məsjø'
And don't forget to set the encoding of your Python source file at the beginning with
# -*- coding: UTF-8 -*-
(Don't even think about using something other than UTF-8. ;))
Update: This only applies to Python < 3. If you're using Python >= 3, you would probably not have run into these problems in the first place. So if upgrading to 3.x is an option, it's the way to go -- it might not be in some cases because of library dependencies etc., unfortunately.
As mentioned in the comments, encoding issues might also result from a differently configured terminal, although that was not the problem here, apparently.
Upvotes: 3