Reputation: 85
I have a list of words in various languages, that make use of more than the usual five vowels (a,e,i,o,u). I am using regex to divide these words into syllables after a vowel-consonant pair. For words like "ahotnikil" I use re.match("^[a|e|i|y|o|u|}|@]", word)
to get the correct segmentation, ['a', 'hot', 'ni', 'kil']
But my re.match expression does not find all the vowels, for words such as:
ɔčolwun ['ɔčol', 'wun']
duktəwurəji ['du', 'ktəwu', 'rəji']
śīnici ['śīnici']
the "ɔ, ə, ī" are not recognized. I tried copying them into the expression but still, they are not recognized. I used the follwoing code to check the encoding type of the file (output: {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''})
import chardet
with open(filename, 'rb') as rawdata:
result = chardet.detect(rawdata.read(100000))
result
So then I don't know what to do.
Upvotes: 1
Views: 60
Reputation: 627034
You can use
re.match("[aeiyou}@]|ɔ|ə|ī", word)
Note: re.match
will look for matches at the start of string by default. If you need to detect these chars anywhere inside a word, use re.search
.
Details:
[aeiyou}@]
- a char from the aeiyou}@
set|
- orɔ
- a ɔ
char|
- orə
- a ə
char|
- orī
- a ī
charUpvotes: 1