suizokukan
suizokukan

Reputation: 1369

how regex in Python3 deal with diacritics?

I try to parse with Python3 and the re module strings using the pattern "(c,c,c)" where c is one character to be choosed among (a,b,ë,ɪ̈ ). I wrote something like that :

src="(a,b,ɪ̈)"
pattern = "[abëɪ̈]"
for r in re.finditer( '\({0},{0},{0}\)'.format(pattern), src ):
    print( r.group() )

But the regex doesn't work with ɪ̈; Python analyses ɪ̈ as made of two characters (ɪ + diairesis), id est ɪ plus a diacritic : the regex doesn't know how to read "(a,b,ɪ̈)". I haven't the same problem with ë; Python analyses ë as one character and my regex is able to read "(a,b,ë)", giving the expected answer. I tried to use a normalize approach thanks to unicodedata.normalize('NFD', ...) applied to src and pattern, unsuccessfully.

How shall I solve this problem ? It would be nice to help me !

PS : I fixed some typos thanks to pythonm.

Upvotes: 1

Views: 215

Answers (1)

jfs
jfs

Reputation: 414179

You could use | to workaround it:

#!/usr/bin/env python3
import re

print(re.findall(r'\({0},{0},{0}\)'.format("(?:[abë]|ɪ̈)"), "(a,b,ɪ̈)"))
# -> ['(a,b,ɪ̈)']

The above treats ɪ̈ as two characters:

re.compile(r'[abë]|ɪ̈', re.DEBUG)

output:

branch 
  in 
    literal 97
    literal 98
    literal 235
or
  literal 618 
  literal 776 

Upvotes: 3

Related Questions