Reputation: 23
I am trying to remove chars from an unicode string. I have a whitelist of allowed unicode chars and I would like to remove everything that is not on the list.
allowed_list = ur'[\u0041-\u005A]|[\u0061-\u007A]|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u012F]|\u0131|[\u0386]|[\u0388-\u038A]'
negated_list = ur'[^\u0041-\u005A]|[^\u0061-\u007A]|[^\u00C0-\u00D6]|[^\u00D8-\u00F6]|[^\u00F8-\u012F]|^\u0131|[^\u0386]|[^\u0388-\u038A]'
I am testing it with a subset of my list and I don't get why it is not working.
This removes all but lowercase latin chars:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0061-\u007A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
'rugg'
This removes all but uppercase latin chars:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0041-\u005A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
'AT'
But when I combine them, all chars get removed:
>>> mystr = 'Arugg^]T'
>>> myre = re.compile(ur'[^\u0041-\u005A]|[^\u0061-\u007A]', re.UNICODE)
>>> result = myre.sub('', mystr)
>>> result
''
When I tested the regex [^\u0041-\u005A]|[^\u0061-\u007A]
on https://pythex.org/ it does what I am expecting, but when I atempt to use it in my code, it is not doing what I want it to. What am I missing?
Thank you in advance!
Upvotes: 1
Views: 166
Reputation:
Implicitly positive, regex class items are OR'd together.
Your regex is then the same as
[\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]
But for the Negative regex class [^]
, items are individually negated then AND'ed together.
That regex is then
[^\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]
which is logically the same as
[^\u0041-\u005A]
and [^\u0061-\u007A]
and [^\u00C0-\u00D6]
and [^\u00D8-\u00F6]
and [^\u00F8-\u012F]
and [^\u0131]
and [^\u0386]
and [^\u0388-\u038A]
What you tried to do was negate each item, then OR them together
which is not the same.
Upvotes: 0
Reputation: 18753
Your regex is not correct, you are using |
which checks if either one is true.
You need to create one expression with multiple ranges,
[^\u0041-\u005A\u0061-\u007A]
will match any characters except range \u0041-\u005A
or \u0061-\u007A
.
import re
regex = r"[^\u0041-\u005A\u0061-\u007A]"
test_str = "Arugg^]T"
myre = re.compile(regex, re.UNICODE)
result = myre.sub('', test_str)
print(result)
# output,
AruggT
Upvotes: 1
Reputation: 51683
You are replacing all characters that are
not in '[^\u0041-\u005A]'
or not in [^\u0061-\u007A]'
(due to the ^
) .
If either one is true, all get replaced by '' - so its always true no matter what you have.
Use ur'[^\u0041-\u005A\u0061-\u007A]'
instead (both ranges inside one [...].
Upvotes: 0