Reputation: 501

Efficient algorithm for phrase anagrams

What is an efficient way to produce phrase anagrams given a string?

The problem I am trying to solve

Assume you have a word list with n words. Given an input string, say, "peanutbutter", produce all phrase anagrams. Some contenders are: pea nut butter, A But Ten Erupt, etc.

My solution

I have a trie that contains all words in the given word list. Given an input string, I calculate all permutations of it. For each permutation, I have a recursive solution (something like this) to determine if that specific permuted string can be broken in to words. For example, if one of the permutations of peanutbutter was "abuttenerupt", I used this method to break it into "a but ten erupt". I use the trie to determine if a string is a valid word.

What sucks

My problem is that because I calculate all permutations, my solution runs very slow for phrases that are longer than 10 characters, which is a big let down. I want to know if there is a way to do this in a different way. Websites like https://wordsmith.org/anagram/ can do the job in less than a second and I am curious to know how they do it.

Upvotes: 4

Answers (2)

Can Nguyen

Reputation: 1470

Your problem can be decomposed to 2 sub-problems:

Find combination of words that use up all characters of the input string
Find all permutations of the words found in the first sub-problem

Subproblem #2 is a basic algorithm and you can find existing standard implementation in most programming language. Let's focus on subproblem #1

First convert the input string to a "character pool". We can implement the character pool as an array oc, where oc[c] = number of occurrence of character c.

Then we use backtracking algorithm to find words that fit in the charpool as in this pseudo-code:

result = empty;

function findAnagram(pool)
  if (pool empty) then print result;
  for (word in dictionary) {
    if (word fit in charpool) {
      result = result + word;
      update pool to exclude characters in word;
      findAnagram(pool);

      // as with any backtracking algorithm, we have to restore global states
      restore pool;
      restore result;
    }
  }
}

Note: If we pass the charpool by value then we don't have to restore it. But as it is quite big, I prefer passing it by reference.

Now we remove redundant results and apply some optimizations:

Assuming A comes before B in the dictionary. If we choose the first word is B, then we don't have to consider word A in following steps, because those results (if we take A) would already be in the case where A is chosen as the first word
If the character set is small enough (< 64 characters is best), we can use a bitmask to quickly filter words that cannot fit in the pool. A bitmask mask which character is in a word, no matter how many time it occurs.

Update the pseudo-code to reflect those optimizations:

function findAnagram(charpool, minDictionaryIndex)
  pool_bitmask <- bitmask(charpool);
  if (pool empty) then print result;
  for (word in dictionary AND word's index >= minDictionaryIndex) {
    // bitmask of every words in the dictionary should be pre-calculated
    word_bitmask <- bitmask(word)
    if (word_bitmask contains bit(s) that is not in pool_bitmask)
      then skip this for iteration
    if (word fit in charpool) {
      result = result + word;
      update charpool to exclude characters in word;
      findAnagram(charpool, word's index);

      // as with any backtracking algorithm, we have to restore global states
      restore pool;
      restore result;
    }
  }
}

My C++ implementation of subproblem #1 where the character set contains only lowercase 'a'..'z': http://ideone.com/vf7Rpl .

Upvotes: 7

samgak

Reputation: 24427

Instead of a two stage solution where you generate permutations and then try and break them into words, you could speed it up by checking for valid words as you recursively generate the permutations. If at any point your current partially-complete permutation does not correspond to any valid words, stop there and do not recurse any further. This means you don't waste time generating useless permutations. For example, if you generate "tt", there is no need to permute "peanubuter" and append all the permutations to "tt" because there are no English words beginning with tt.

Suppose you are doing basic recursive permutation generation, keep track of the current partial word you have generated. If at any point it is a valid word, you can output a space and start a new word, and recursively permute the remaining character. You can also try adding each of the remaining characters to the current partial word, and only recurse if doing so results in a valid partial word (i.e. a word exists starting with those characters).

Something like this (pseudo-code):

 void generateAnagrams(String partialAnagram, String currentWord, String remainingChars)
 {
      // at each point, you can either output a space, or each of the remaining chars:

      // if the current word is a complete valid word, you can output a space
      if(isValidWord(currentWord))
      {
           // if there are no more remaining chars, output the anagram:
           if(remainingChars.length == 0)
           {
               outputAnagram(partialAnagram);
           }
           else
           {
               // output a space and start a new word
               generateAnagrams(partialAnagram + " ", "", remainingChars);
           }
      }

      // for each of the chars in remainingChars, check if it can be
      // added to currentWord, to produce a valid partial word (i.e.
      // there is at least 1 word starting with these characters)
      for(i = 0 to remainingChars.length - 1)
      {
          char c = remainingChars[i];
          if(isValidPartialWord(currentWord + c)
          {
              generateAnagrams(partialAnagram + c, currentWord + c,
                  remainingChars.remove(i));
          }
      }
 }

You could call it like this

 generateAnagrams("", "", "peanutbutter");

You could optimize this algorithm further by passing the node in the trie corresponding to the current partially completed word, as well as passing currentWord as a string. This would make your isValidPartialWord check even faster.

You can enforce uniqueness by changing your isValidWord check to only return true if the word is in ascending (greater or equal) alphabetic order compared to the previous word output. You might also need another check for dupes at the end, to catch cases where two of the same word can be output.

Upvotes: 2

Efficient algorithm for phrase anagrams

Answers (2)

Related Questions