Reputation: 818
I want to do a simple replace like:
line= line.replace ('ʃ',' sh ')
line= line.replace ('ɐ͂',' an ')
line= line.replace ('ẽ',' en ')
The problem is that python does not accept these characters.
I tried also tried things like:
line= line.replace (u'\u0283',' sh ')
but I still can't open anything because I get a decoding error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)
I messed around with codecs but I couldn't find anything suitable, maybe I am going down the wrong path?
Upvotes: 1
Views: 2260
Reputation: 177951
You can use non-ASCII characters in Python, but you have to tell Python the encoding of your source file with a #coding
statement. Make sure to save the source in the encoding declared. It is also good practice to do all text processing in Unicode:
#!python2
#coding:utf8
line = u'This is a ʃɐ͂ẽ test'
line = line.replace (u'ʃ',u' sh ')
line = line.replace (u'ɐ͂',u' an ')
line = line.replace (u'ẽ',u' en ')
print line
Output:
This is a sh an en test
Note that ɐ͂ is actually two Unicode codepoints ɐ
(U+0250) and a combining codepoint of U+0342 COMBINING GREEK PERISPOMENI. The ẽ
can be represented either as a single codepoint U+1EBD LATIN SMALL LETTER E WITH TILDE, or as two codepoints U+0065 LATIN SMALL LETTER E and U+0303 COMBINING TILDE. To make sure you are using single combined codepoints or decomposed characters the unicodedata
module can be used:
import unicodedata as ud
line = ud.normalize('NFC',line) # combined.
line = ud.normalize('NFD',line) # decomposed.
There is also NFKD and NFKC. See the Unicode standard for details on which is best for you.
If you are reading from a file, use io.open
and specify the encoding of the file to automatically convert the input to Unicode:
with io.open('data.txt','r',encoding='utf8') as f:
with line as f:
# do something with Unicode line.
Upvotes: 2