Paolo Magnani
Paolo Magnani

Reputation: 699

How to ignore case but not diacritics with Python regex?

I'm working with a set of regex patterns that I have to match in a target text.

My problematic regex is something like this: (İg)[[:punct:][:space:]]+[[:alnum:]]+

Initially, I noticed that Python’s re package doesn’t support character classes like [:punct:]. Then I discovered that with the regex library (instead of re), these forms would actually be supported.

The problem now is that, with both re and regex, enabling IGNORECASE it seems to also ignore diacritics (that I want to consider). For example:

#import re
import regex as re

active_patterns = ["(İg)[[:punct:][:space:]]+[[:alnum:]]+"]
text = "A big problem"

for pattern in active_patterns:
    compiled_pattern = re.compile(pattern, re.IGNORECASE)
    for match in compiled_pattern.finditer(text):
        print(match)

In this code, I want to ignore case but not diacritics. However, it seems that regex library ignore diacritics when IGNORECASE is enabled. Indeed this snippet will print "ig problem". The same behaviour happens with re library if I remove not supported parts, so with the regex (İg). It will print only ig in that case.

Is there a way in Python to make the regex ignore case but keep diacritics intact?

Upvotes: 7

Views: 131

Answers (0)

Related Questions