Reputation: 322
I am working with Python and I want to match a given string with multiple substrings. I have tried to solve this problem in two different ways. My first solution was to match the substring with the string like:
str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([x.upper() for x in value if x.lower() in str.lower()])
print(temp)
which results in temp = ["TEST", "MATCH", "MULTIPLE", "RING"]
.
However, this is not the result I would like. The substrings should have an exact match, so "ring" should not match with "string".
This is why I tried to solve this problem with regular expressions, like this:
str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([
x.upper() for x in value
if regex.search(
r"\b" + regex.escape(x) + r"\b", str, regex.IGNORECASE
) is not None
])
print(temp)
which results in ["TEST", "MATCH", "MULTIPLE"]
, the correct solution.
Be that as it may, this solution takes too long to compute. I have to do this check for roughly 1 million strings and the solution using regex will take days to finish compared to the 1.5 hours it takes using the first solution.
Is there a way to either make the first solution work, or the second solution to run faster?
value
can also contain numbers, or a short phrase like "test1 test2".
Upvotes: 4
Views: 11007
Reputation: 2171
You could split the string by space and then match the elements from value
with ==
.
You said that some strings in values
can have space before or after them. You can resolve that with this line:
values = [i.strip() for i in values]
That will remove all of the whitespace characters before and after the string (in your case for each element).
Furthermore, you mentioned that if you split the string by space, some words have punctuations left over from splitting, e.g. 'Hi, how are you?'
would result in ['Hi,', 'how', 'are', 'you?']
. You can resolve this problem by using the string startswith()
method to filter all the words starting with elements from values
like this:
str = ['Hi,', 'how', 'are', 'you?']`
values = ['how', 'you', 'time', 'space']
new_str = []
for word in str:
for j in values:
if word.startswith(j):
new_str.append(word)
# result -> ['how', 'you?']
Then you can remove punctuations from the resulting list with some regex, but now you will have a lot smaller list to iterate through. After you remove all of the punctuation characters, then you can match the whole strings as I suggested.
Upvotes: 0
Reputation: 149796
It's hard to suggest an optimal solution without seeing the actual data, but you can try these things:
'^'
or '*'
).list.extend()
repeatedly.# 'str' is a built-in function, so use another name instead
string = 'A Test test string from which I want to match multiple substrings'
values = ['test', 'test2', 'Multiple', 'ring', 'match']
pattern = r'\b({})\b'.format('|'.join(map(re.escape, values)))
# unique matches, uppercased
matches = set(map(str.upper, re.findall(pattern, string, regex.IGNORECASE)))
# arrange the results as they appear in `values`
matched_values = [v for value in values if (v := value.upper()) in matches]
print(matched_values) # ['TEST', 'MULTIPLE', 'MATCH']
Upvotes: 6
Reputation: 1
It takes about 3 seconds for 1 million executions of the 'statement' variable on my laptop using the following code:
from timeit import timeit
import re
# I inserted punctuation to cover more situations
string = """
This, is a ; test string from which I want to match multiple substrings
"""
substrings = ['test', 'match', 'multiple', 'ring']
statement = """
string_normalized = (re.sub(r'[^\w\s]', '', string)
.strip()
.lower()
.split(' '))
result = list(filter(lambda x: x in string_normalized, substrings))
"""
if __name__ == '__main__':
time_it_takes = timeit(stmt=statement, globals=globals(), number=1_000_000)
print(time_it_takes)
Upvotes: -1
Reputation: 76194
Two possible optimizations come to mind:
re.compile
so it doesn't recompile every time you call match
.
import re
str = "This is a test string from which I want to match test1 test2 multiple substrings"
values = ["test", "match", "multiple", "ring", "test1 test2"]
pattern = re.compile("|".join(r"\b" + re.escape(x) + r"\b" for x in values))
temp = []
temp.extend([x.upper() for x in pattern.findall(str, re.IGNORECASE)])
print(temp)
Result:
['TEST', 'MATCH', 'TEST1 TEST2', 'MULTIPLE']
Potential drawbacks to this approach:
values
. This approach puts results in the order they appear in str
.temp
if it appeared multiple times in str
. As opposed to your original approach, where the value appears at most once in temp
.search
terminates as soon as it finds a match. findall
always searches the entire string. If you expect most of your strings to match every word in value
, and expect most matches to appear early on in the string, then findall
may be slower than search
. On the other hand, if you expect search to often turn up None
, then findall
will likely be somewhat faster. Upvotes: 3