How to replace all characters except letters, numbers, forward and back slashes

Question

Want to parse through text and return only letters, digits, forward and back slashes and replace all else with ''.

Is it possible to use just one regex pattern as opposed to several which then calls for looping? Am unable to get the pattern below not to replace the back and forward slash.

line1 = "1/R~e`p!l@@a#c$e%% ^A&l*l( S)-p_e+c=ial C{har}act[er]s ;E  xce|pt Forw:ard" $An>d B,?a..ck Sl'as



The code below from SO had been tested and found to be faster than regex but then it replaces all special characters including the / and \ that I want to preserve. Is there any way to edit it to work for my use case and still maintain its edge over regex?

line2 = ''.join(e for e in line2 if e.isalnum())
print("replace2: " + line2)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2


As an extra hurdle, the text am parsing should be in ASCII form so if possible characters from any other encoding should also be replaced by ''

Veedrac · Accepted Answer

A fair bit faster and works for Unicode:

full_pattern = re.compile('[^a-zA-Z0-9\/]|_')

def re_replace(string):
    return re.sub(full_pattern, '', string)

If you want it really fast, this is by far the best (but slightly obscure) method:

def wanted(character):
    return character.isalnum() or character in '\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    # Remove all non-ASCII characters. Heavily optimised.
    string = string.encode('ascii', errors='ignore').decode('ascii')

    # Remove unwanted ASCII characters
    return string.translate(ascii_code_point_filter)

Timings:

SETUP="
busy = ''.join(chr(i) for i in range(512))

import re
full_pattern = re.compile('[^a-zA-Z0-9\/]|_')

def in_whitelist(character):
    return character.isalnum() or character in '\/'

def re_replace(string):
    return re.sub(full_pattern, '', string)

def wanted(character):
    return character.isalnum() or character in '\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    string = string.encode('ascii', errors='ignore').decode('ascii')
    return string.translate(ascii_code_point_filter)
"

python -m timeit -s "$SETUP" "re_replace(busy)"
python -m timeit -s "$SETUP" "''.join(e for e in busy if in_whitelist(e))"
python -m timeit -s "$SETUP" "fast_replace(busy)"

Results:

10000 loops, best of 3: 63 usec per loop
10000 loops, best of 3: 135 usec per loop
100000 loops, best of 3: 4.98 usec per loop

How to replace all characters except letters, numbers, forward and back slashes

Answers (2)

Related Questions