Mahdi
Mahdi

Reputation: 1035

How can I use regular expression for unicode string in python?

Hi I wanna use regular expression for unicode utf-8 in following string:

</td><td>عـــــــــــادي</td><td> 40.00</td>

I want to pick "عـــــــــــادي" out, how Can I do this?

My code for this is :

state = re.findall(r'td>...</td',s)

Thanks

Upvotes: 5

Views: 12327

Answers (2)

Stefan van den Akker
Stefan van den Akker

Reputation: 6999

I ran across something similar when trying to match a string in Russian. For your situation, Michele's answer works fine. If you want to use special sequences like \w and \s, though, you have to change some things. I'm just sharing this, hoping it will be useful to someone else.

>>> string = u"</td><td>Я люблю мороженое</td><td> 40.00</td>"

Make your string unicode by placing a u before the quotation marks

>>> pattern = re.compile(ur'>([\w\s]+)<', re.UNICODE)

Set the flag to unicode, so that it will match unicode strings as well (see docs).

(Alternatively, you can use your local language to set a range. For Russian this would be [а-яА-Я], so:

pattern = re.compile(ur'>([а-яА-Я\s]+)<')

In that case, you don't have to set a flag anymore, since you're not using a special sequence.)

>>> match = pattern.findall(string)
>>> for i in match:
...     print i
... 
Я люблю мороженое

Upvotes: 6

Michele Spagnuolo
Michele Spagnuolo

Reputation: 932

According to PEP 0264: Defining Python Source Code Encodings, first you need to tell Python the whole source file is UTF-8 encoded by adding a comment like this to the first line:

# -*- coding: utf-8 -*-

Furthermore, try adding 'ur' before the string so that it's raw and Unicode:

state = re.search(ur'td>([^<]+)</td',s)
res = state.group(1)

I've also edited your regex to make it match. Three dots mean "exactly three characters", but since you are using UTF-8, which is a multi-byte encoding, this may not work as expected.

Upvotes: 3

Related Questions