Improve Big O notation of algorithm from O(n^2) to something better

Question

I'm looking to improve an algorithm that I currently have which, whilst it works, currently has the complexity of O(n^2). I'm looking to reduce that complexity if possible, or improve/change the algorithm itself in order to improve the runtime.

I have a list of strings that each contain multiple words and the end goal is to find "matches" between these strings, sorted based upon a percentage "likeness".

Let's say I have the following strings:

"The End Of The World"
"The Start Of The Journey"
"The End Of Time"
"Time We Left This World Today"

My algorithm performs the following steps:

Iterate through each string, breaking each string into it's constituent words and re-ordering those words alphabetically (case is insensitive in the whole algorithm). (i.e. "The End Of The World" becomes "End Of The The World". "Time We Left This World Today" becomes "Left This Time Today We World" etc.)
For business reasons, certain words are stripped from the processed string. This is usually pronouns and other such words - i.e. a, the etc., so "The End Of The World" becomes "End Of World".
We now have a list of strings, broken and reassembled alphabetically from their constituent words with specific non-essential words removed.
Firstly, I can simply see if there are any exact duplicates in the list. This is trivial and allows me to identify those strings that are effectively a 100% match.
Now, however, comes the harder part and the slowest portion of the algorithm. I have to iterate over the list of strings, comparing each string with every other string in the list (i.e. a nested loop) to determine how many words each string, from the two being compared, have in common. i.e. When comparing "End Of World" and "End Of Time", there's 66.6% commonality since both strings have two out of three words in common. When comparing "End Of World" with "Left This Time Today We World" we find there's only one word in common between the two strings (since there's differing numbers of words in each string, the actual percentage in this case is calculated based upon a kind-of average between the two - so approx. 22% commonality).

Ultimately, I'm left with pairs of strings (every possible pairing of all strings in the starting list) and a percentage value of the match between them. I can then discard all those matches below some threshold and work only with those that are above the threshold. The threshold is user-defined, and the whole algorithm serves as a way to "filter" a very large set of data, allowing human eyeballs to work with only on those pieces of data that seem closely matched in the first place.

As you can imagine from the nested loop (i.e. the O(n^2)) section of the algorithm, this is very slow and gets considerably slower as the size of input grows.

Is there any way to improve the Big O of this algorithm or are there any changes to the algorithm producing the same output that will improve the runtime complexity?

Improve Big O notation of algorithm from O(n^2) to something better

Answers (1)

Related Questions