Reputation: 699
I'm working with a set of regex patterns that I have to match in a target text.
My problematic regex is something like this: (İg)[[:punct:][:space:]]+[[:alnum:]]+
Initially, I noticed that Python’s re
package doesn’t support character classes like [:punct:]
. Then I discovered that with the regex
library (instead of re
), these forms would actually be supported.
The problem now is that, with both re
and regex
, enabling IGNORECASE
it seems to also ignore diacritics (that I want to consider). For example:
#import re
import regex as re
active_patterns = ["(İg)[[:punct:][:space:]]+[[:alnum:]]+"]
text = "A big problem"
for pattern in active_patterns:
compiled_pattern = re.compile(pattern, re.IGNORECASE)
for match in compiled_pattern.finditer(text):
print(match)
In this code, I want to ignore case but not diacritics. However, it seems that regex
library ignore diacritics when IGNORECASE
is enabled. Indeed this snippet will print "ig problem". The same behaviour happens with re
library if I remove not supported parts, so with the regex (İg)
. It will print only ig
in that case.
Is there a way in Python to make the regex ignore case but keep diacritics intact?
Upvotes: 7
Views: 131