UMR
UMR

Reputation: 39

How to find Unicode Pattern using Regex in Python3.7?

I am trying to find a Unicode pattern but it always returns an empty list [ ]. I have tried the same pattern in Kwrite and it worked fine.

I have tried \u \\u in place of \w but didn't work for me. Here Unicode string can be any Unicode string.

InputString=r"[[ਅਤੇ\CC_CCD]]_CCP"

Result = re.findall(r'[\[]+[\w]+\\\w+[\]]+[_]\w+',InputString,flags=re.U)

print(Result)

Upvotes: 1

Views: 200

Answers (1)

Gurmanjot Singh
Gurmanjot Singh

Reputation: 10360

There seems to be an extra character between and \ which cannot be matched by \w+. It's hex value is 0xA47 So, I have added [\u0A47] in the regex.

Try this Regex:

\[+\w+[\u0A47]\\\w+]]\w+

Click for Demo

Explanation:

  • \[+ - matches 1+ occurrences of [
  • \w+ - matches 1+ occurrences of a word character
  • [^\\]* - matches 0+ occurrences of any character which is not \
  • \\ - matches \
  • \w+ - matches 1+ occurrences of a word character
  • ]] - matches ]]
  • \w+ - matches 1+ occurrences of a word character

Python code

The words are from Gurmukhi language. The unicode range is 0A00 - 0A7F. So you can also use the regex:

\[+[\u0A00-\u0A7F]+\\\w+]]\w+

Click for Demo

Upvotes: 1

Related Questions