Reputation: 6204
Consider this snippet using regular expressions in Python 3:
>>> t = "Meu cão é #paraplégico$."
>>> re.sub("[^A-Za-z0-9 ]","",t,flags=re.UNICODE)
'Meu co paraplgico'
Why does it delete non-ASCII characters? I tried without the flag and it's all the same.
As a bonus, can anyone make this work on Python 2.7 as well?
Upvotes: 2
Views: 5249
Reputation: 3623
You are substituting non-alphanumeric characters([^A-Za-z0-9 ]
) with blank(""
). The non-ASCII characters are not among A-Z, a-z, or 0-9, so they get substituted.
You can match all word characters like this:
>>> t = "Meu cão é #paraplégico$."
>>> re.sub("[^\w ]","",t, flags=re.UNICODE)
>>> 'Meu cão é paraplégico'
Or you could add the characters into your regex like so: [^A-Za-z0-9ãé ]
.
Upvotes: 4
Reputation: 6204
I solved this by switching to the regex library (from PyPI).
then the regex command became:
regex.sub(ur"[^\p{L}\p{N} ]+", u"", t)
Upvotes: 0
Reputation: 6203
[In 1]: import regex
[In 2]: t = u"Meu cão é #paraplégico$."
[In 3]: regex.sub(r"[^\p{Alpha} ]","",t,flags=regex.UNICODE)
[In 4]: print(regex.sub(r"[^\p{Alpha} ]","",t,flags=regex.UNICODE))
Meu cão é paraplégico
Upvotes: 3