Reputation: 10133

Word boundary to use in unicode text for Python regex

I want to use word boundary in a regex for matching some unicode text. Unicode letters are detected as word boundary in Python regex as here:

>>> re.search(r"\by\b","üyü")
<_sre.SRE_Match object at 0x02819E58>

>>> re.search(r"\by\b","ğyğ")
<_sre.SRE_Match object at 0x028250C8>

>>> re.search(r"\by\b","uyu")
>>>

What should I do in order to make the word boundary symbol not match unicode letters?

Upvotes: 6

Answers (3)

Alexander Lubyagin

Reputation: 1494

#!/usr/bin/python
# -*- coding: utf-8 -*-

s = ur"abcd ААБВ"
import re
rx1 = re.compile(ur"(?u)АБВ")
rx2 = re.compile(ur"(?u)АБВ\b")
rx3 = re.compile(ur"(?u)\bАБВ\b")
print rx1.findall(s)
print rx2.findall(s)
print rx3.findall(s)

print re.search(ur'(?u)ривет\b', ur'Привет')
print re.search(ur'(?u)\bривет\b', ur'Привет')

Output:

[u'\u0410\u0411\u0412']
[u'\u0410\u0411\u0412']
[]
<_sre.SRE_Match object at 0x01F056B0>
None

Upvotes: 0

rolandvarga

Reputation: 126

You can use it the following way:

re.search(r'(?u)\by\b', 'üyü')

To gain familiarity with flags experiment with the following: (?iLmsux)

As a good read check out Core Python Applications Programming 3rd edition..There is a nice chapter on Regex' in it.

Upvotes: 5

user861537

Reputation:

Use re.UNICODE:

>>> re.search(r"\by\b","üyü", re.UNICODE)
>>>

Upvotes: 9

Word boundary to use in unicode text for Python regex

Answers (3)

Related Questions