Reputation: 285
I have a text file that my script is reading and getting the most frequent words from. However, at one point in the process of doing that, during the clean-up of the source text, it cannot handle accented characters (in this case, they are áéíóöőúüű).
This is what I have at the moment.
str = re.sub(r'\W+', ' ', str)
This simply deletes the accented characters. I have tried adding flags=re.U
, but it just messed up the result in a different way. I suspect there is a simple way to solve my problem and I have looked for it, but haven't been successful and so I turn to you. Thanks in advance.
Upvotes: 3
Views: 1647
Reputation: 626870
You need to use the right modifier:
str = re.sub(ur'\W+', u' ', s, flags=re.UNICODE)
^^^^^^^^^^
See Python 2.x docs:
Make the
\w
,\W
,\b
,\B
,\d
,\D
,\s
and\S
sequences dependent on the Unicode character properties database. Also enables non-ASCII matching for IGNORECASE.
Here is an online Python 2.7 demo:
import re
s = u"characters (in this case, they are áéíóöőúüű)."
res = re.sub(ur'\W+', u' ', s, flags=re.UNICODE).encode("utf8")
print(res) # => characters in this case they are áéíóöőúüű
Upvotes: 3