lukik
lukik

Reputation: 4060

How to replace all characters except letters, numbers, forward and back slashes

Want to parse through text and return only letters, digits, forward and back slashes and replace all else with ''.

Is it possible to use just one regex pattern as opposed to several which then calls for looping? Am unable to get the pattern below not to replace the back and forward slash.

line1 = "1/R~e`p!l@@a#c$e%% ^A&l*l( S)-p_e+c=ial C{har}act[er]s ;E  xce|pt Forw:ard\" $An>d B,?a..ck Sl'as<he#s\\2"
line2 = line
RGX_PATTERN = "[^\w]", "_"

for pattern in RGX_PATTERN:
    line = re.sub(r"%s" %pattern, '', line)
print("replace1: " + line)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

The code below from SO had been tested and found to be faster than regex but then it replaces all special characters including the / and \ that I want to preserve. Is there any way to edit it to work for my use case and still maintain its edge over regex?

line2 = ''.join(e for e in line2 if e.isalnum())
print("replace2: " + line2)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

As an extra hurdle, the text am parsing should be in ASCII form so if possible characters from any other encoding should also be replaced by ''

Upvotes: 2

Views: 9274

Answers (2)

Veedrac
Veedrac

Reputation: 60127

A fair bit faster and works for Unicode:

full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def re_replace(string):
    return re.sub(full_pattern, '', string)

If you want it really fast, this is by far the best (but slightly obscure) method:

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    # Remove all non-ASCII characters. Heavily optimised.
    string = string.encode('ascii', errors='ignore').decode('ascii')

    # Remove unwanted ASCII characters
    return string.translate(ascii_code_point_filter)

Timings:

SETUP="
busy = ''.join(chr(i) for i in range(512))

import re
full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def in_whitelist(character):
    return character.isalnum() or character in '\\/'

def re_replace(string):
    return re.sub(full_pattern, '', string)

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    string = string.encode('ascii', errors='ignore').decode('ascii')
    return string.translate(ascii_code_point_filter)
"

python -m timeit -s "$SETUP" "re_replace(busy)"
python -m timeit -s "$SETUP" "''.join(e for e in busy if in_whitelist(e))"
python -m timeit -s "$SETUP" "fast_replace(busy)"

Results:

10000 loops, best of 3: 63 usec per loop
10000 loops, best of 3: 135 usec per loop
100000 loops, best of 3: 4.98 usec per loop

Upvotes: 9

Master_Yoda
Master_Yoda

Reputation: 1132

Why can't you do something like:

def in_whitelist(character):
    return character.isalnum() or character in ['\\','/']

line2 = ''.join(e for e in line2 if in_whitelist(e))

Edited as per suggestion to condense function.

Upvotes: 4

Related Questions