Reputation: 151
I'm looking to improve an algorithm that I currently have which, whilst it works, currently has the complexity of O(n^2). I'm looking to reduce that complexity if possible, or improve/change the algorithm itself in order to improve the runtime.
I have a list of strings that each contain multiple words and the end goal is to find "matches" between these strings, sorted based upon a percentage "likeness".
Let's say I have the following strings:
"The End Of The World"
"The Start Of The Journey"
"The End Of Time"
"Time We Left This World Today"
My algorithm performs the following steps:
Ultimately, I'm left with pairs of strings (every possible pairing of all strings in the starting list) and a percentage value of the match between them. I can then discard all those matches below some threshold and work only with those that are above the threshold. The threshold is user-defined, and the whole algorithm serves as a way to "filter" a very large set of data, allowing human eyeballs to work with only on those pieces of data that seem closely matched in the first place.
As you can imagine from the nested loop (i.e. the O(n^2)) section of the algorithm, this is very slow and gets considerably slower as the size of input grows.
Is there any way to improve the Big O of this algorithm or are there any changes to the algorithm producing the same output that will improve the runtime complexity?
Upvotes: 2
Views: 1659
Reputation: 16099
There is the extra complication if your pulling the strings around with you in all computations, which makes the last operation not O(M^2) but O(M^2 * sizeof(sentence) * AvgLength(word))
Lets see (concept code)
std::vector<std::set<int>> sSets;
sentenceSets.reserve(sentences.size());
for(auto& sentence : sentences) { // O(m)
std::vector<const char *> words = SplitWord(sentence); // O(n) needs to go through all letters.
sSet.emplace_back();
for(auto& word: words) {
int wordNo = LookUp(word); // table of all words, with entries of 0 for unwanted words. O(log AllWords)
if (wordNo)
sSet.back().insert(wordNo); // also removes duplicates. O(Log(#diff words in sentence))
}
}
Total O(m Log(AllWords) avgWordLen) or O(m collisionFactor avgWordLen) if you believe your hash table of all possible words works perfectly.
LookUp
saves a factor O(letters in word) for all later compares.
for(const auto& theSet : sSet) { // O(sSet.size()
for(const auto& cmpSet : sSet) { // O(sSet.size()
std::vector<int> intersect;
std::set_intersection(theSet.begin(), theSet.end(),
cmpSet.begin(), cmpSet.end(),
std::back_insert<std::vector<int>>(intersect)); // O(set.size())
StoreRes(theSet, cmpSet, intersect);
}
}
Total here is O(sSet.size()^2*O(set.size()). Could be optimized to only run O(sSet.size()*sSet.size()/2) as the table is symmetric.
Using the LookUp
saves a factor O(word size) here.
The std::set might be replaced with some flat_set for faster real world operations.
Upvotes: 1