rhoadster91
rhoadster91

Reputation: 344

Clusters of words with Hamming distance of 1

You are given a set of words, e.g.

{ ded, aaa, dec, aab, cab, def }

You need to add each word into a group. A word can be added to a group if:

  1. It is the only word in the group
  2. There is at least one other word already in the group that is at the most 1 edit distance away from the given word.

Your function must return the minimum possible number of such groups that can be formed by using all the given words.

Example, for the given input string, the groups would look like this:

{ aaa, aab, cab }, { ded, def, dec }

Explanation: distance(aaa, aab) is 1 so they belong in the same group. Also, distance(aab, cab) is also 1, so they also belong in the same group. But no words in the second group are at an edit distance of 1 from any other word in the first group, but are at an edit distance of 1 from at least one other word in their own group.

If we were given two more words in addition to the ones in the example, let's say cad, ced, then the answer would change to 1, because now distance(cab, cad) is 1, hence cad is in group 1 and distance(cad, ced) is 1, so ced is in group 1. Also, distance(ded, ced) is 1, so the second group will be "connected" with the first group, hence we will be left with only 1 group.

We're only interested in the number of groups, not the groups themselves.

Constraints: all words will have the same length, but that length is not fixed and could be large.

I could only come up with O(mn^2) where m is the length of any word and n is number of words. Did this using graph approach (Each word as a node and word with edit distance 1 as a neighbouring node).

Expected solution is O(mn).

Upvotes: 1

Views: 1167

Answers (1)

rhoadster91
rhoadster91

Reputation: 344

Found a solution which is an extension of the accepted solution here: Efficiently build a graph of words with given Hamming distance

Basically, the idea is to store the strings in a Set where lookup and delete are O(1) on average. Putting them in a set means we'd be overwriting strings with edit distance of 0 i.e. equal strings. But we don't care for them anyway, as they will always be in the same group.

  1. Create an empty list of "start nodes" N.
  2. Add next item from the set S in the list
  3. Remove this string from the set S and call 4 for this string.
  4. Generate all strings with Hamming distance 1 from string passed in parameter. For each such generated string, if it exists in the set, remove it from the set and call 4 for this string.
  5. While set is not empty, repeat 2
  6. Return size of "start nodes" list

Explanation of why this would work:

We traverse each node only once and remove it from the set. After we remove the string from the set, we also recursively remove any item in the set that was "adjacent" to it. But only the first node in the recursion is added to the start nodes list.

In our example, ded would get added to the node list and dec, def would get removed. Then aaa would get added to the node list and aab would be removed. While removing aab, recursively, cab would also be removed. The returned answer would be 2.

Time complexity:

O(mnC) where C is the size of the charset, m is the length of the string and n is the number of strings.

C substitutions made for each character in string m times. This is done once for each item in the string set.

Upvotes: 2

Related Questions