Peter
Peter

Reputation: 3495

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?

I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)

Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.

#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]

#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )

print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])

I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.

Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.

Upvotes: 1

Views: 875

Answers (2)

Slam
Slam

Reputation: 8572

The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.

And one more thing: try using

for i in allCombinations:
    if i not in allWordsJoined:
        invalidCombinations.add(i)

instead of

invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)

I'm not sure, but less memory allocations can be a small boost for real data run.

Upvotes: 1

Ryan O'Donnell
Ryan O'Donnell

Reputation: 617

Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.

You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.

Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

Upvotes: 0

Related Questions