xpanta
xpanta

Reputation: 8418

python find-replace non-latin word in string with regex

I am trying to do this:

val = re.sub(r'\b' + u_word +'\b', unicode(new_word), u_text)

(All strings are non-latin.)

It does not work, at all!.

Is it possible to find-replace non-latin words (whole words) in a non-latin text with regex? How?

EDIT:

If you want to test try these strings:

>>> u_word = u'αβ'
>>> u_text = u'αβγ αβ αβγδ δαβ'
>>> new_word = u'χχ'
>>> val = re.sub(r'\b' + u_word +r'\b', unicode(new_word), u_text)
>>> val
u'\u03b1\u03b2\u03b3 \u03b1\u03b2 \u03b1\u03b2\u03b3\u03b4 \u03b4\u03b1\u03b2'
>>> u_text
u'\u03b1\u03b2\u03b3 \u03b1\u03b2 \u03b1\u03b2\u03b3\u03b4 \u03b4\u03b1\u03b2'
>>> 

Upvotes: 2

Views: 1357

Answers (1)

Jon-Eric
Jon-Eric

Reputation: 17275

You need to pass the re.UNICODE flag to sub, like so:

val = re.sub(r'\b' + u_word + r'\b', unicode(new_word), u_text, flags=re.UNICODE)

\b is a word boundary. Without the re.UNICODE flag, a "word" contains only characters from the set [a-zA-Z0-9_], so αβ isn't seen as a "word". For more information see the re documentation (specifically \b, \w, and re.UNICODE).

FYI:

  • If new_word is already a unicode string (as in your example), unicode(new_word) is superfluous, it returns new_word unmodified.
  • In Python 3.x, unicode is no longer a special case. Your code would have worked as is in Python 3.x (minus unicode() which was removed because it's no longer necessary).

Upvotes: 1

Related Questions