copperhead
copperhead

Reputation: 677

efficient way of making a graph of a dictionary

What is the most efficient way of making a graph of the words in a dictionary with hamming distance=1??

Upvotes: 5

Views: 1178

Answers (2)

Michael
Michael

Reputation: 42100

I would suggest an adjacency map map<string, list<string> to represent the graph. It maps a graph node (a word in our case) to the list of the adjacent nodes (i.e. all words within distance == 1).

How to populate this map ? First, break down the dictionary into sets of words of the same length set<string> wordset[MAX_WORD_LENGTH] and populate the map as follows.

map<string, list<string>> adjacency_map
for each wordset in wordset array  
  for each w1 in wordset
     for each w2 in wordset
        if one_char_off(w1, w2) { // if the distance between w1 and w2 == 1
           update(adjacency_map, w1, w2)
           update(adjacency_map, w2, w1)
        }

Function one_char_off(string w1, string w2) checks if the distance between w1 and w2 is 1.

int diff = 0
for i in [0, w1.length) // both w1 and w2 have the same length
   if w1[i] != w2[i]
      if ++diff > 1
         return false
return diff == 1

Function update(map<string, list<string>> adjacency_map, string w1, string w2) adds the pair of adjacent words w1 and w2 to the adjacent map.

Upvotes: 1

Nick Johnson
Nick Johnson

Reputation: 101149

Hamming distance is only defined for words of equal length, so you'll actually have one disjoint graph for each word length in your dictionary. If you meant levenshtein distance, which permits insertions and deletions, then you will indeed have a single graph.

One option is to construct a BK-tree from your dictionary. While not strictly speaking a graph, it allows you to ask the same questions (getting a list of elements with a given distance), and takes O(n log n) time to construct.

Another option is brute-force: For every word, test its distance to all candidate words. You can narrow down candidate words to those of the same length (or length one less or greater, for levenshtein). This is O(n^2) worst-case, but this may be acceptable if you're not building the graph more than once.

Theoretically, there's probably an O(n log n) method of constructing the graph - in the trivial case, constructing a BK-tree, then generating the graph from that is O(mn log n), where m is the average number of edges per node - but I'm not aware of an elegant one.

Upvotes: 6

Related Questions