Using unicode char code in regular expression

Question

Simplifying my task, lets say I want to find any words written in Hebrew in some web page. So I know that Hebrew char codes are U+05D0 to U+05EA. I want to write something like:

expr = "[\u05D0-\u05EA]+"
url = "https://en.wikipedia.org/wiki/Category:Countries"    

web_handle = urllib2.urlopen(url)
website_text = website_handle.read()    
matches = sre.findall(exp, website_text)
for item in matches:
    print item

The output I would expect is:

עברית

But instead the out put is a lot of Chinese/Japanese chars.

Sanich · Accepted Answer

The expression should be:

expr = u"[\u05D0-\u05EA]+"

Notice the 'u' at the beginning.

Using unicode char code in regular expression

Answers (2)

Related Questions