Sanich
Sanich

Reputation: 1835

Using unicode char code in regular expression

Simplifying my task, lets say I want to find any words written in Hebrew in some web page. So I know that Hebrew char codes are U+05D0 to U+05EA. I want to write something like:

expr = "[\u05D0-\u05EA]+"
url = "https://en.wikipedia.org/wiki/Category:Countries"    

web_handle = urllib2.urlopen(url)
website_text = website_handle.read()    
matches = sre.findall(exp, website_text)
for item in matches:
    print item

The output I would expect is:

עברית

But instead the out put is a lot of Chinese/Japanese chars.

Upvotes: 1

Views: 248

Answers (2)

Sanich
Sanich

Reputation: 1835

The expression should be:

expr = u"[\u05D0-\u05EA]+"

Notice the 'u' at the beginning.

Upvotes: 0

Kasravnd
Kasravnd

Reputation: 107287

You can just use standard representation of unicode in python within a character class :

re.findall([\u05D0-\u05EA], website_text,re.U)

Upvotes: 1

Related Questions