Reputation: 23

How to substitute chars using unicode regex range

I am trying to remove chars from an unicode string. I have a whitelist of allowed unicode chars and I would like to remove everything that is not on the list.

    allowed_list = ur'[\u0041-\u005A]|[\u0061-\u007A]|[\u00C0-\u00D6]|[\u00D8-\u00F6]|[\u00F8-\u012F]|\u0131|[\u0386]|[\u0388-\u038A]'
    negated_list = ur'[^\u0041-\u005A]|[^\u0061-\u007A]|[^\u00C0-\u00D6]|[^\u00D8-\u00F6]|[^\u00F8-\u012F]|^\u0131|[^\u0386]|[^\u0388-\u038A]'

I am testing it with a subset of my list and I don't get why it is not working.

This removes all but lowercase latin chars:

    >>> mystr = 'Arugg^]T'
    >>> myre = re.compile(ur'[^\u0061-\u007A]', re.UNICODE)
    >>> result = myre.sub('', mystr)
    >>> result
    'rugg'

This removes all but uppercase latin chars:

    >>> mystr = 'Arugg^]T'
    >>> myre = re.compile(ur'[^\u0041-\u005A]', re.UNICODE)
    >>> result = myre.sub('', mystr)
    >>> result
    'AT'

But when I combine them, all chars get removed:

    >>> mystr = 'Arugg^]T'
    >>> myre = re.compile(ur'[^\u0041-\u005A]|[^\u0061-\u007A]', re.UNICODE)
    >>> result = myre.sub('', mystr)
    >>> result
    ''

When I tested the regex [^\u0041-\u005A]|[^\u0061-\u007A] on https://pythex.org/ it does what I am expecting, but when I atempt to use it in my code, it is not doing what I want it to. What am I missing?

Thank you in advance!

Upvotes: 1

Answers (3)

user557597

Reputation:

Implicitly positive, regex class items are OR'd together.

Your regex is then the same as

[\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]

But for the Negative regex class [^], items are individually negated then AND'ed together.

That regex is then

[^\u0041-\u005a\u0061-\u007a\u00c0-\u00d6\u00d8-\u00f6\u00f8-\u012f\u0131\u0386\u0388-\u038a]

which is logically the same as

[^\u0041-\u005A] and [^\u0061-\u007A] and [^\u00C0-\u00D6] and [^\u00D8-\u00F6] and [^\u00F8-\u012F] and [^\u0131] and [^\u0386] and [^\u0388-\u038A]

What you tried to do was negate each item, then OR them together
which is not the same.

Upvotes: 0

Sufiyan Ghori

Reputation: 18753

Your regex is not correct, you are using | which checks if either one is true.

You need to create one expression with multiple ranges,

[^\u0041-\u005A\u0061-\u007A] will match any characters except range \u0041-\u005A or \u0061-\u007A.

import re

regex = r"[^\u0041-\u005A\u0061-\u007A]"

test_str = "Arugg^]T"

myre = re.compile(regex, re.UNICODE)
result = myre.sub('', test_str)
print(result)

# output,
AruggT

Upvotes: 1

Patrick Artner

Reputation: 51683

You are replacing all characters that are

not in '[^\u0041-\u005A]' or not in [^\u0061-\u007A]' (due to the ^) .

If either one is true, all get replaced by '' - so its always true no matter what you have.

Use ur'[^\u0041-\u005A\u0061-\u007A]' instead (both ranges inside one [...].

Upvotes: 0

How to substitute chars using unicode regex range

Answers (3)

Related Questions