Reputation: 39
I am trying to find a Unicode pattern but it always returns an empty list [ ]. I have tried the same pattern in Kwrite and it worked fine.
I have tried \u \\u in place of \w but didn't work for me. Here Unicode string can be any Unicode string.
InputString=r"[[ਅਤੇ\CC_CCD]]_CCP"
Result = re.findall(r'[\[]+[\w]+\\\w+[\]]+[_]\w+',InputString,flags=re.U)
print(Result)
Upvotes: 1
Views: 200
Reputation: 10360
There seems to be an extra character ੇ
between ਤ
and \
which cannot be matched by \w+
. It's hex value is 0xA47
So, I have added [\u0A47]
in the regex.
Try this Regex:
\[+\w+[\u0A47]\\\w+]]\w+
Explanation:
\[+
- matches 1+ occurrences of [
\w+
- matches 1+ occurrences of a word character[^\\]*
- matches 0+ occurrences of any character which is not \
\\
- matches \
\w+
- matches 1+ occurrences of a word character]]
- matches ]]
\w+
- matches 1+ occurrences of a word characterThe words are from Gurmukhi language. The unicode range is 0A00 - 0A7F
. So you can also use the regex:
\[+[\u0A00-\u0A7F]+\\\w+]]\w+
Upvotes: 1