StackOverflower
StackOverflower

Reputation: 1097

More efficient way to find phrases in a string?

I have a list that contains 100,000+ words/phrases sorted by length

let list = [“string with spaces”, “another string”, “test”, ...]

I need to find the longest element in the list above that is inside a given sentence. This is my initial solution

for item in list {
    if sentence == item
        || sentence.startsWith(item + “ “) 
        || sentence.contains(“ “ + item + “ “) 
        || sentence.endsWith(“ “ + item) {
        ...
        break
    }
}

This issue I am running into is that this is too slow for my application. Is there a different approach I could take to make this faster?

Upvotes: 1

Views: 371

Answers (3)

StackOverflower
StackOverflower

Reputation: 1097

The solution I decided to use was a Trie https://en.wikipedia.org/wiki/Trie. Each node in the trie is a word, and all I do is tokenize the input sentence (by word) and traverse the trie.

This improved performance from ~140 seconds to ~5 seconds

Upvotes: 0

mcdowella
mcdowella

Reputation: 19601

You could build an Aho-Corasick searcher from the list and then run this on the sentence. According to https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm "The complexity of the algorithm is linear in the length of the strings plus the length of the searched text plus the number of output matches. Note that because all matches are found, there can be a quadratic number of matches if every substring matches (e.g. dictionary = a, aa, aaa, aaaa and input string is aaaa). "

Upvotes: 1

Asad Saeeduddin
Asad Saeeduddin

Reputation: 46628

I would break the given sentence up into a list of words and then compute all possible contiguous sublists (i.e. phrases). Given a sentence of n words, there are n * (n + 1) / 2 possible phrases that can be found inside it.

If you now substitute your list of search phrases ([“string with spaces”, “another string”, “test”, ...]) for an (amortized) constant time lookup data structure like a hashset, you can walk over the list of phrases you computed in the previous step and check whether each one is in the set in ~ constant time.

The overall time complexity of this algorithm scales quadratically in the size of the sentence, and is roughly independent of the size of the set of search terms.

Upvotes: 0

Related Questions