Reputation: 1369
I try to parse with Python3 and the re module strings using the pattern "(c,c,c)" where c is one character to be choosed among (a,b,ë,ɪ̈ ). I wrote something like that :
src="(a,b,ɪ̈)"
pattern = "[abëɪ̈]"
for r in re.finditer( '\({0},{0},{0}\)'.format(pattern), src ):
print( r.group() )
But the regex doesn't work with ɪ̈; Python analyses ɪ̈ as made of two characters (ɪ + diairesis), id est ɪ plus a diacritic : the regex doesn't know how to read "(a,b,ɪ̈)". I haven't the same problem with ë; Python analyses ë as one character and my regex is able to read "(a,b,ë)", giving the expected answer. I tried to use a normalize approach thanks to unicodedata.normalize('NFD', ...) applied to src and pattern, unsuccessfully.
How shall I solve this problem ? It would be nice to help me !
PS : I fixed some typos thanks to pythonm.
Upvotes: 1
Views: 215
Reputation: 414179
You could use |
to workaround it:
#!/usr/bin/env python3
import re
print(re.findall(r'\({0},{0},{0}\)'.format("(?:[abë]|ɪ̈)"), "(a,b,ɪ̈)"))
# -> ['(a,b,ɪ̈)']
The above treats ɪ̈
as two characters:
re.compile(r'[abë]|ɪ̈', re.DEBUG)
output:
branch
in
literal 97
literal 98
literal 235
or
literal 618
literal 776
Upvotes: 3