regex find long and short vowels

Question

I have a list of words in various languages, that make use of more than the usual five vowels (a,e,i,o,u). I am using regex to divide these words into syllables after a vowel-consonant pair. For words like "ahotnikil" I use re.match("^[a|e|i|y|o|u|}|@]", word) to get the correct segmentation, ['a', 'hot', 'ni', 'kil']
But my re.match expression does not find all the vowels, for words such as:

ɔčolwun ['ɔčol', 'wun']

duktəwurəji ['du', 'ktəwu', 'rəji']

śīnici ['śīnici']

the "ɔ, ə, ī" are not recognized. I tried copying them into the expression but still, they are not recognized. I used the follwoing code to check the encoding type of the file (output: {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''})

import chardet
with open(filename, 'rb') as rawdata:    
    result = chardet.detect(rawdata.read(100000))
result

So then I don't know what to do.

regex find long and short vowels

Answers (1)

Related Questions