Reputation: 344
You are given a set of words, e.g.
{ ded, aaa, dec, aab, cab, def }
You need to add each word into a group. A word can be added to a group if:
- It is the only word in the group
- There is at least one other word already in the group that is at the most 1 edit distance away from the given word.
Your function must return the minimum possible number of such groups that can be formed by using all the given words.
Example, for the given input string, the groups would look like this:
{ aaa, aab, cab }, { ded, def, dec }
Explanation:
distance(aaa, aab)
is 1 so they belong in the same group. Also,distance(aab, cab)
is also 1, so they also belong in the same group. But no words in the second group are at an edit distance of 1 from any other word in the first group, but are at an edit distance of 1 from at least one other word in their own group.If we were given two more words in addition to the ones in the example, let's say
cad, ced
, then the answer would change to 1, because nowdistance(cab, cad)
is 1, hence cad is in group 1 anddistance(cad, ced)
is 1, so ced is in group 1. Also, distance(ded, ced) is 1, so the second group will be "connected" with the first group, hence we will be left with only 1 group.We're only interested in the number of groups, not the groups themselves.
Constraints: all words will have the same length, but that length is not fixed and could be large.
I could only come up with O(mn^2) where m is the length of any word and n is number of words. Did this using graph approach (Each word as a node and word with edit distance 1 as a neighbouring node).
Expected solution is O(mn).
Upvotes: 1
Views: 1167
Reputation: 344
Found a solution which is an extension of the accepted solution here: Efficiently build a graph of words with given Hamming distance
Basically, the idea is to store the strings in a Set where lookup and delete are O(1) on average. Putting them in a set means we'd be overwriting strings with edit distance of 0 i.e. equal strings. But we don't care for them anyway, as they will always be in the same group.
Explanation of why this would work:
We traverse each node only once and remove it from the set. After we remove the string from the set, we also recursively remove any item in the set that was "adjacent" to it. But only the first node in the recursion is added to the start nodes list.
In our example, ded would get added to the node list and dec, def would get removed. Then aaa would get added to the node list and aab would be removed. While removing aab, recursively, cab would also be removed. The returned answer would be 2.
Time complexity:
O(mnC) where C is the size of the charset, m is the length of the string and n is the number of strings.
C substitutions made for each character in string m times. This is done once for each item in the string set.
Upvotes: 2