How do I get this regular expression to ignore accented characters?

Question

I have a text file that my script is reading and getting the most frequent words from. However, at one point in the process of doing that, during the clean-up of the source text, it cannot handle accented characters (in this case, they are áéíóöőúüű).

This is what I have at the moment.

str = re.sub(r'\W+', ' ', str)

This simply deletes the accented characters. I have tried adding flags=re.U, but it just messed up the result in a different way. I suspect there is a simple way to solve my problem and I have looked for it, but haven't been successful and so I turn to you. Thanks in advance.

Wiktor Stribiżew · Accepted Answer

You need to use the right modifier:

str = re.sub(ur'\W+', u' ', s, flags=re.UNICODE)
                                     ^^^^^^^^^^

See Python 2.x docs:

Make the \w, \W, \b, \B, \d, \D, \s and \S sequences dependent on the Unicode character properties database. Also enables non-ASCII matching for IGNORECASE.

Here is an online Python 2.7 demo:

import re
s = u"characters (in this case, they are áéíóöőúüű)."
res = re.sub(ur'\W+', u' ', s, flags=re.UNICODE).encode("utf8")
print(res) # => characters in this case they are áéíóöőúüű

How do I get this regular expression to ignore accented characters?

Answers (1)

Related Questions