zazzylele
zazzylele

Reputation: 85

regex find long and short vowels

I have a list of words in various languages, that make use of more than the usual five vowels (a,e,i,o,u). I am using regex to divide these words into syllables after a vowel-consonant pair. For words like "ahotnikil" I use re.match("^[a|e|i|y|o|u|}|@]", word) to get the correct segmentation, ['a', 'hot', 'ni', 'kil']
But my re.match expression does not find all the vowels, for words such as:

ɔčolwun ['ɔčol', 'wun']

duktəwurəji ['du', 'ktəwu', 'rəji']

śīnici ['śīnici']

the "ɔ, ə, ī" are not recognized. I tried copying them into the expression but still, they are not recognized. I used the follwoing code to check the encoding type of the file (output: {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''})

import chardet
with open(filename, 'rb') as rawdata:    
    result = chardet.detect(rawdata.read(100000))
result

So then I don't know what to do.

Upvotes: 1

Views: 60

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627034

You can use

re.match("[aeiyou}@]|ɔ|ə|ī", word)

Note: re.match will look for matches at the start of string by default. If you need to detect these chars anywhere inside a word, use re.search.

Details:

  • [aeiyou}@] - a char from the aeiyou}@ set
  • | - or
  • ɔ - a ɔ char
  • | - or
  • ə - a ə char
  • | - or
  • ī - a ī char

Upvotes: 1

Related Questions