Reputation: 1097
I have a list that contains 100,000+ words/phrases sorted by length
let list = [“string with spaces”, “another string”, “test”, ...]
I need to find the longest element in the list above that is inside a given sentence. This is my initial solution
for item in list {
if sentence == item
|| sentence.startsWith(item + “ “)
|| sentence.contains(“ “ + item + “ “)
|| sentence.endsWith(“ “ + item) {
...
break
}
}
This issue I am running into is that this is too slow for my application. Is there a different approach I could take to make this faster?
Upvotes: 1
Views: 371
Reputation: 1097
The solution I decided to use was a Trie https://en.wikipedia.org/wiki/Trie. Each node in the trie is a word, and all I do is tokenize the input sentence (by word) and traverse the trie.
This improved performance from ~140 seconds to ~5 seconds
Upvotes: 0
Reputation: 19601
You could build an Aho-Corasick searcher from the list and then run this on the sentence. According to https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm "The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa). "
Upvotes: 1
Reputation: 46628
I would break the given sentence up into a list of words and then compute all possible contiguous sublists (i.e. phrases). Given a sentence of n
words, there are n * (n + 1) / 2
possible phrases that can be found inside it.
If you now substitute your list of search phrases ([“string with spaces”, “another string”, “test”, ...]
) for an (amortized) constant time lookup data structure like a hashset, you can walk over the list of phrases you computed in the previous step and check whether each one is in the set in ~ constant time.
The overall time complexity of this algorithm scales quadratically in the size of the sentence, and is roughly independent of the size of the set of search terms.
Upvotes: 0