Reputation: 332

What is a better way to match multiple substrings in a string than to search for each substring in a loop?

I am working with Python and I want to match a given string with multiple substrings. I have tried to solve this problem in two different ways. My first solution was to match the substring with the string like:

str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([x.upper() for x in value if x.lower() in str.lower()])
print(temp)

which results in temp = ["TEST", "MATCH", "MULTIPLE", "RING"].

However, this is not the result I would like. The substrings should have an exact match, so "ring" should not match with "string".

This is why I tried to solve this problem with regular expressions, like this:

str = "This is a test string from which I want to match multiple substrings"
value = ["test", "match", "multiple", "ring"]
temp = []
temp.extend([
    x.upper() for x in value
    if regex.search(
        r"\b" + regex.escape(x) + r"\b", str, regex.IGNORECASE
    ) is not None
])
print(temp)

which results in ["TEST", "MATCH", "MULTIPLE"], the correct solution.

Be that as it may, this solution takes too long to compute. I have to do this check for roughly 1 million strings and the solution using regex will take days to finish compared to the 1.5 hours it takes using the first solution.

Is there a way to either make the first solution work, or the second solution to run faster?

value can also contain numbers, or a short phrase like "test1 test2".

Upvotes: 4

Answers (4)

Novak

Reputation: 2171

You could split the string by space and then match the elements from value with ==.

You said that some strings in values can have space before or after them. You can resolve that with this line:

values = [i.strip() for i in values]

That will remove all of the whitespace characters before and after the string (in your case for each element).

Furthermore, you mentioned that if you split the string by space, some words have punctuations left over from splitting, e.g. 'Hi, how are you?' would result in ['Hi,', 'how', 'are', 'you?']. You can resolve this problem by using the string startswith() method to filter all the words starting with elements from values like this:

str = ['Hi,', 'how', 'are', 'you?']`
values = ['how', 'you', 'time', 'space']

new_str = []
for word in str:
  for j in values:
    if word.startswith(j):
      new_str.append(word)

# result -> ['how', 'you?']

Then you can remove punctuations from the resulting list with some regex, but now you will have a lot smaller list to iterate through. After you remove all of the punctuation characters, then you can match the whole strings as I suggested.

Upvotes: 0

Eugene Yarmash

Reputation: 150178

It's hard to suggest an optimal solution without seeing the actual data, but you can try these things:

Generate a single pattern matching all values. This way you would only need to search the string once (instead of once per value).
Skip escaping values unless they contain special characters (like '^' or '*').
Create the resulting list using a list comprehension instead of calling list.extend() repeatedly.

# 'str' is a built-in function, so use another name instead
string = 'A Test test string from which I want to match multiple substrings'
values = ['test', 'test2', 'Multiple', 'ring', 'match']
pattern = r'\b({})\b'.format('|'.join(map(re.escape, values)))

# unique matches, uppercased
matches = set(map(str.upper, re.findall(pattern, string, regex.IGNORECASE)))

# arrange the results as they appear in `values`
matched_values = [v for value in values if (v := value.upper()) in matches]
print(matched_values)  # ['TEST', 'MULTIPLE', 'MATCH']

Upvotes: 6

tudor38

Reputation: 1

It takes about 3 seconds for 1 million executions of the 'statement' variable on my laptop using the following code:

from timeit import timeit
import re

# I inserted punctuation to cover more situations
string = """
This, is a ; test string from which I want to match multiple substrings
"""
substrings = ['test', 'match', 'multiple', 'ring']

statement = """
string_normalized = (re.sub(r'[^\w\s]', '', string)
                       .strip()
                       .lower()
                       .split(' '))
result = list(filter(lambda x: x in string_normalized, substrings))
"""

if __name__ == '__main__':
    time_it_takes = timeit(stmt=statement, globals=globals(), number=1_000_000)
    print(time_it_takes)

Upvotes: -1

Kevin

Reputation: 76254

Two possible optimizations come to mind:

precompile patterns with re.compile so it doesn't recompile every time you call match.
rather than matching against four independent regexes, create one regex that matches all of your values.

import re

str = "This is a test string from which I want to match test1 test2 multiple substrings"
values = ["test", "match", "multiple", "ring", "test1 test2"]

pattern = re.compile("|".join(r"\b" + re.escape(x) + r"\b" for x in values))
temp = []

temp.extend([x.upper() for x in pattern.findall(str, re.IGNORECASE)])
print(temp)

Result:

['TEST', 'MATCH', 'TEST1 TEST2', 'MULTIPLE']

Potential drawbacks to this approach:

The output will possibly be in a different order. Your original approach puts results in the order they appear in values. This approach puts results in the order they appear in str.
the same value will appear multiple times in temp if it appeared multiple times in str. As opposed to your original approach, where the value appears at most once in temp.
search terminates as soon as it finds a match. findall always searches the entire string. If you expect most of your strings to match every word in value, and expect most matches to appear early on in the string, then findall may be slower than search. On the other hand, if you expect search to often turn up None, then findall will likely be somewhat faster.

Upvotes: 3

What is a better way to match multiple substrings in a string than to search for each substring in a loop?

Answers (4)

Related Questions