badner
badner

Reputation: 818

IPA to Arpabet python

I want to do a simple replace like:

line= line.replace ('ʃ',' sh ')
line= line.replace ('ɐ͂',' an ')
line= line.replace ('ẽ',' en ')

The problem is that python does not accept these characters.

I tried also tried things like:

line= line.replace (u'\u0283',' sh ')

but I still can't open anything because I get a decoding error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

I messed around with codecs but I couldn't find anything suitable, maybe I am going down the wrong path?

Upvotes: 1

Views: 2260

Answers (1)

Mark Tolonen
Mark Tolonen

Reputation: 177951

You can use non-ASCII characters in Python, but you have to tell Python the encoding of your source file with a #coding statement. Make sure to save the source in the encoding declared. It is also good practice to do all text processing in Unicode:

#!python2
#coding:utf8
line = u'This is a ʃɐ͂ẽ test'
line = line.replace (u'ʃ',u' sh ')
line = line.replace (u'ɐ͂',u' an ')
line = line.replace (u'ẽ',u' en ')
print line

Output:

This is a  sh  an  en  test

Note that ɐ͂ is actually two Unicode codepoints ɐ (U+0250) and a combining codepoint of U+0342 COMBINING GREEK PERISPOMENI. The can be represented either as a single codepoint U+1EBD LATIN SMALL LETTER E WITH TILDE, or as two codepoints U+0065 LATIN SMALL LETTER E and U+0303 COMBINING TILDE. To make sure you are using single combined codepoints or decomposed characters the unicodedata module can be used:

import unicodedata as ud
line = ud.normalize('NFC',line) # combined.
line = ud.normalize('NFD',line) # decomposed.

There is also NFKD and NFKC. See the Unicode standard for details on which is best for you.

If you are reading from a file, use io.open and specify the encoding of the file to automatically convert the input to Unicode:

with io.open('data.txt','r',encoding='utf8') as f:
    with line as f:
        # do something with Unicode line.

Upvotes: 2

Related Questions