fccoelho
fccoelho

Reputation: 6204

Regexp for non-ASCII characters

Consider this snippet using regular expressions in Python 3:

>>> t = "Meu cão é #paraplégico$."
>>> re.sub("[^A-Za-z0-9 ]","",t,flags=re.UNICODE)
'Meu co  paraplgico'

Why does it delete non-ASCII characters? I tried without the flag and it's all the same.

As a bonus, can anyone make this work on Python 2.7 as well?

Upvotes: 2

Views: 5249

Answers (3)

Yeonho
Yeonho

Reputation: 3623

You are substituting non-alphanumeric characters([^A-Za-z0-9 ]) with blank(""). The non-ASCII characters are not among A-Z, a-z, or 0-9, so they get substituted.

You can match all word characters like this:

>>> t = "Meu cão é #paraplégico$."
>>> re.sub("[^\w ]","",t, flags=re.UNICODE)
>>> 'Meu cão é paraplégico'

Or you could add the characters into your regex like so: [^A-Za-z0-9ãé ].

Upvotes: 4

fccoelho
fccoelho

Reputation: 6204

I solved this by switching to the regex library (from PyPI).

then the regex command became:

regex.sub(ur"[^\p{L}\p{N} ]+", u"", t)

Upvotes: 0

dda
dda

Reputation: 6203

[In 1]: import regex
[In 2]: t = u"Meu cão é #paraplégico$."
[In 3]: regex.sub(r"[^\p{Alpha} ]","",t,flags=regex.UNICODE)
[In 4]: print(regex.sub(r"[^\p{Alpha} ]","",t,flags=regex.UNICODE))

Meu cão é paraplégico

Upvotes: 3

Related Questions