Reputation: 340
Please let me preface this by saying this is not a duplicate of replacing letters in a string. This is whole substring replacement.
I have a set of documents which need to have several different substrings to be replaced with either empty strings or other values. What's the fastest way to do this? Is there a way that is comparably faster than using regex?
There is a significant difference when you perform string substitution in a word by word / character by character, however this will not work in this case unless some form of string matching is being used.
Here are my previous attempts at this.
import re
def standard_for_loop(string, replacements):
# case sensitive for loop
for key, value in replacements.items():
# would not work unless case specific
string = string.replace(key, value)
return string
def regex_loop(string, replacements):
#case insensitive regex substitution in for
for key, value in replacements.items():
string = re.sub(key, value, string, re.IGNORECASE)
return string
def regex_multiple(string, replacements):
# case insensitive regex substitution using lambda
pattern = re.compile("({})".format("|".join(replacements.keys())), re.IGNORECASE)
return pattern.sub(lambda m: replacements[m.string.lower()[m.start():m.end()]], string)
def case_insensitive_for_loop(string, replacements):
def find_next(string, pattern, sub):
if pattern.lower() in string.lower():
match = string.lower().index(pattern.lower())
end = int(match + len(pattern))
new_string = string[end:]
# yield a replaced substring of original string
yield string[:match] + sub
yield from find_next(new_string, pattern, sub)
'''
# this is what I'm unsure about. How to negate need for
# for loop here and how to fix the append issue.
# Currently the functionality works but it appends output
# replacement to the result. I know the "+=" is the
# cause of the problem, but I'm not sure how to fix this.
'''
result = ''
for k, v in replacements.items():
for output in find_next(string, k, v):
result += output
return result
There are two problems, from my experience, the regex_multiple
is the most accurate but takes quite along time to complete. The next most accurate is the case_sensitive_for_loop
but I don't know how to overcome the replacement vs. appending issue.
For example, it will replace the doc:
# for a sample document
doc = """Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, ipsum nec volutpat dapibus, felis magna scelerisque enim, et rutrum nunc ligula eget augue. Phasellus aliquam feugiat venenatis. Sed lobortis pharetra ipsum ut venenatis.
Nullam ut accumsan orci. Vivamus faucibus augue in facilisis facilisis. Donec ut scelerisque ipsum. Ut mollis elit nibh, ut vulputate eros ultrices ac. Nunc ac urna sed libero imperdiet maximus non sed dui. Morbi ornare eu eros eget pharetra. Vivamus vestibulum nisi eu eros pulvinar aliquet.
Maecenas at justo bibendum, viverra urna nec, pellentesque orci. Cras ut molestie sem. Proin in tincidunt ex. Aliquam euismod id ligula a bibendum. Morbi at diam euismod, auctor ex non, venenatis ante. Proin convallis ex eu semper posuere. Etiam sed tincidunt massa. Vivamus aliquam mollis massa, nec lacinia est dictum vitae. In varius convallis pulvinar. Pellentesque aliquet pulvinar nibh vel dictum"""
#replacement strings where k is the substring to be searched and v is the value to be replaced with
repl = { 'venenatis ante':'',
'ipsum nec volutpat dapibus' : '',
'ipsum vulputate accumsan' : '',
'dolor sit amet':'',
'vivamus aliquam mollis massa':''
}
with :
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, ipsum nec volutpat dapibus, felis magna scelerisque enim, et rutrum nunc ligula eget augue. Phasellus aliquam feugiat venenatis. Sed lobortis pharetra ipsum ut venenatis.
Nullam ut accumsan orci. Vivamus faucibus augue in facilisis facilisis. Donec ut scelerisque ipsum. Ut mollis elit nibh, ut vulputate eros ultrices ac. Nunc ac urna sed libero imperdiet maximus non sed dui. Morbi ornare eu eros eget pharetra. Vivamus vestibulum nisi eu eros pulvinar aliquet.
Maecenas at justo bibendum, viverra urna nec, pellentesque orci. Cras ut molestie sem. Proin in tincidunt ex. Aliquam euismod id ligula a bibendum. Morbi at diam euismod, auctor ex non, Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, Lorem ipsum Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus accumsan pulvinar massa ut pulvinar. Cras blandit quam non dictum tempus. Maecenas id posuere nibh. Nullam sit amet pharetra neque. Etiam nec imperdiet tellus. Nulla facilisi. Proin sit amet massa aliquam, pulvinar justo in, suscipit purus. Fusce in tempus orci. In consectetur, ipsum nec volutpat dapibus, felis magna scelerisque enim, et rutrum nunc ligula eget augue. Phasellus aliquam feugiat venenatis. Sed lobortis pharetra ipsum ut venenatis.
Nullam ut accumsan orci. Vivamus faucibus augue in facilisis facilisis. Donec ut scelerisque ipsum. Ut mollis elit nibh, ut vulputate eros ultrices ac. Nunc ac urna sed libero imperdiet maximus non sed dui. Morbi ornare eu eros eget pharetra. Vivamus vestibulum nisi eu eros pulvinar aliquet.
Maecenas at justo bibendum, viverra urna nec, pellentesque orci. Cras ut molestie sem. Proin in tincidunt ex. Aliquam euismod id ligula a bibendum. Morbi at diam euismod, auctor ex non, venenatis ante. Proin convallis ex eu semper posuere. Etiam sed tincidunt massa.
After having compared them, the standard_for_loop
is fastest coming at 4u secs per loop in 50k loops. The second fastest is the regex_loop
at 14 u secs per loop in 20k loop. Then followed by the case_sensitive_for_loop
which took 28.3u seconds per loop in 10k loops. The lambda expression in regex_multiple
surprisingly takes the longest time to complete at 103u seconds in 2k loops.
Here are the python timeit outputs for each function.
Wondering if there are any string matching algorithms that I have negated to look at for this. Any advice welcome
Upvotes: 1
Views: 139
Reputation: 50488
regex_multiple
is inefficient because if recompute the whole string every time there is a match. You can just lower the matched string. Here is how:
def regex_multiple(string, replacements):
# case insensitive regex substitution using lambda
pattern = re.compile("({})".format("|".join(replacements.keys())), re.IGNORECASE)
return pattern.sub(lambda m: replacements[m[0].lower()], string)
This solution should be fast compared to the others case-insensitive implementations and much faster than the original one on big documents.
Note however that you are comparing case-sensitive and case-insensitive methods. Working with case-insensitive replacement is much more computationally intensive and thus slower. To be fair, you should compare methods doing the same thing.
Finally, If you only work on ASCII documents. You can add the flag re.ASCII
to the regexp. This makes the parsing a bit faster.
Upvotes: 2