Reputation: 1835
Simplifying my task, lets say I want to find any words written in Hebrew in some web page.
So I know that Hebrew char codes are U+05D0
to U+05EA
.
I want to write something like:
expr = "[\u05D0-\u05EA]+"
url = "https://en.wikipedia.org/wiki/Category:Countries"
web_handle = urllib2.urlopen(url)
website_text = website_handle.read()
matches = sre.findall(exp, website_text)
for item in matches:
print item
The output I would expect is:
עברית
But instead the out put is a lot of Chinese/Japanese chars.
Upvotes: 1
Views: 248
Reputation: 1835
The expression should be:
expr = u"[\u05D0-\u05EA]+"
Notice the 'u' at the beginning.
Upvotes: 0
Reputation: 107287
You can just use standard representation of unicode in python within a character class :
re.findall([\u05D0-\u05EA], website_text,re.U)
Upvotes: 1