Reputation: 4060
Want to parse through text and return only letters, digits, forward and back slashes and replace all else with ''
.
Is it possible to use just one regex pattern as opposed to several which then calls for looping? Am unable to get the pattern below not to replace the back and forward slash.
line1 = "1/R~e`p!l@@a#c$e%% ^A&l*l( S)-p_e+c=ial C{har}act[er]s ;E xce|pt Forw:ard\" $An>d B,?a..ck Sl'as<he#s\\2"
line2 = line
RGX_PATTERN = "[^\w]", "_"
for pattern in RGX_PATTERN:
line = re.sub(r"%s" %pattern, '', line)
print("replace1: " + line)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2
The code below from SO had been tested and found to be faster than regex but then it replaces all special characters including the / and \ that I want to preserve. Is there any way to edit it to work for my use case and still maintain its edge over regex?
line2 = ''.join(e for e in line2 if e.isalnum())
print("replace2: " + line2)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2
As an extra hurdle, the text am parsing should be in ASCII form so if possible characters from any other encoding should also be replaced by ''
Upvotes: 2
Views: 9274
Reputation: 60127
A fair bit faster and works for Unicode:
full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')
def re_replace(string):
return re.sub(full_pattern, '', string)
If you want it really fast, this is by far the best (but slightly obscure) method:
def wanted(character):
return character.isalnum() or character in '\\/'
ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]
def fast_replace(string):
# Remove all non-ASCII characters. Heavily optimised.
string = string.encode('ascii', errors='ignore').decode('ascii')
# Remove unwanted ASCII characters
return string.translate(ascii_code_point_filter)
Timings:
SETUP="
busy = ''.join(chr(i) for i in range(512))
import re
full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')
def in_whitelist(character):
return character.isalnum() or character in '\\/'
def re_replace(string):
return re.sub(full_pattern, '', string)
def wanted(character):
return character.isalnum() or character in '\\/'
ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]
def fast_replace(string):
string = string.encode('ascii', errors='ignore').decode('ascii')
return string.translate(ascii_code_point_filter)
"
python -m timeit -s "$SETUP" "re_replace(busy)"
python -m timeit -s "$SETUP" "''.join(e for e in busy if in_whitelist(e))"
python -m timeit -s "$SETUP" "fast_replace(busy)"
Results:
10000 loops, best of 3: 63 usec per loop
10000 loops, best of 3: 135 usec per loop
100000 loops, best of 3: 4.98 usec per loop
Upvotes: 9
Reputation: 1132
Why can't you do something like:
def in_whitelist(character):
return character.isalnum() or character in ['\\','/']
line2 = ''.join(e for e in line2 if in_whitelist(e))
Edited as per suggestion to condense function.
Upvotes: 4