How to find Unicode Pattern using Regex in Python3.7?

Question

I am trying to find a Unicode pattern but it always returns an empty list [ ]. I have tried the same pattern in Kwrite and it worked fine.

I have tried \u \u in place of \w but didn't work for me. Here Unicode string can be any Unicode string.

InputString=r"[[ਅਤੇ\CC_CCD]]_CCP"

Result = re.findall(r'[$$]+[\w]+\\w+[$$]+[_]\w+',InputString,flags=re.U)

print(Result)

Gurmanjot Singh · Accepted Answer

There seems to be an extra character ੇ between ਤ and \ which cannot be matched by \w+. It's hex value is 0xA47 So, I have added [\u0A47] in the regex.

Try this Regex:

$$+\w+[\u0A47]\\w+]]\w+

Click for Demo

Explanation:

\[+ - matches 1+ occurrences of [
\w+ - matches 1+ occurrences of a word character
[^$$* - matches 0+ occurrences of any character which is not \
\ - matches \
\w+ - matches 1+ occurrences of a word character
]] - matches ]]
\w+ - matches 1+ occurrences of a word character

Python code

The words are from Gurmukhi language. The unicode range is 0A00 - 0A7F. So you can also use the regex:

\[+[\u0A00-\u0A7F]+\\w+]]\w+

Click for Demo

How to find Unicode Pattern using Regex in Python3.7?

Answers (1)

Related Questions