Reputation: 59709

Algorithm to generate anagrams

What would be the best strategy to generate anagrams.

An anagram is a type of word play, the result of rearranging the letters
of a word or phrase to produce a new  word or phrase, using all the original
letters exactly once; 
ex.
Eleven plus two is anagram of Twelve plus one

A decimal point is anagram of I'm a dot in place

Astronomers is anagram of Moon starers

At first it looks straightforwardly simple, just to jumble the letters and generate all possible combinations. But what would be the efficient approach to generate only the words in dictionary.

I came across this page, Solving anagrams in Ruby.

But what are your ideas?

Upvotes: 57

Answers (15)

Lance Pollard

Reputation: 79458

In SQL, if you have a database of strings (language_string), and for each string, corresponding records in language_string_symbol (count symbols for each string), then you can simply fetch all the language strings joined with the string symbol records with the corresponding counts. The tables might be like:

CREATE TABLE language_string (
    id BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
    text VARCHAR(255) NOT NULL UNIQUE
);

CREATE TABLE language_string_symbol (
    string__id BIGINT NOT NULL,
    symbol CHAR(1) NOT NULL,
    count INT NOT NULL,
    PRIMARY KEY (string__id, symbol)
);

Count the characters in the input, and fetch all things by joining language_string_symbol with language_string, like this:

SELECT s.id
FROM language_string s
JOIN language_string_symbol sym ON s.id = sym.string__id
WHERE sym.symbol IN ('h', 'e', 'l', 'o', 'w', 'r', 'd') -- Only relevant symbols
GROUP BY s.id
HAVING COUNT(CASE WHEN sym.symbol = 'h' AND sym.count <= 1 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'e' AND sym.count <= 1 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'l' AND sym.count <= 3 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'o' AND sym.count <= 2 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'w' AND sym.count <= 1 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'r' AND sym.count <= 1 THEN 1 END) >= 1
  AND COUNT(CASE WHEN sym.symbol = 'd' AND sym.count <= 1 THEN 1 END) >= 1;

This will match against the database of strings, taking any user input, but not generating dynamic new combinations of database string records. It will just match against what is there in the database.

So if you have twelve plus one in the database, then searching eleven plus two will find it. But if twelve plus one is not in the database, and you only have twelve, plus, and one as separate records, it will not find it.

It appears you can use "Recursive CTE" for multi-word combinations in SQL, something like this:

WITH RECURSIVE anagrams AS (
    -- Base case: Start with single words
    SELECT 
        ls.id AS word_id,
        ls.text AS combination,
        icc.symbol,
        GREATEST(icc.count - lss.count, 0) AS remaining_count
    FROM language_string ls
    JOIN language_string_symbol lss ON ls.id = lss.string_id
    JOIN input_character_counts icc ON lss.symbol = icc.symbol
    GROUP BY ls.id
    HAVING SUM(CASE WHEN lss.count > icc.count THEN 1 ELSE 0 END) = 0 -- No character exceeds input count

    UNION ALL

    -- Recursive case: Add another word to the combination
    SELECT 
        a.word_id,
        CONCAT(a.combination, ' ', ls.text) AS combination,
        icc.symbol,
        GREATEST(a.remaining_count - lss.count, 0) AS remaining_count
    FROM anagrams a
    JOIN language_string ls ON ls.id != a.word_id -- Avoid using the same word twice
    JOIN language_string_symbol lss ON ls.id = lss.string_id
    JOIN input_character_counts icc ON lss.symbol = icc.symbol
    WHERE a.remaining_count > 0 -- Only continue if there are characters left
    GROUP BY ls.id
)
SELECT combination
FROM anagrams
WHERE remaining_count = 0; -- Only complete anagrams

That would return these if searching twelve plus one:

Eleven plus two
Eleven two plus
plus two eleven
plus eleven two
Two plus eleven
Two eleven plus

That doesn't appear to be that useful though, especially if you inputed "araceraceraceaaa", doesn't seem useful to return:

a race race race a a a
a race race race aa a
a race race race a aa
a race race race aaa
a racer ace race a a a
a racer ace race aa a
a racer ace race a aa
a racer ace race aaa
...

Or stuff like that.

Better to just match what is in the database, using the symbol counts for each string.

The first query will also return nested anagrams (i.e. words which are shorter than the input, but which contain a subset of the characters), not sure if you asked about that, but it's a common case.

An alternative approach I thought about for a SQL database was creating an anagram table which had a parent_id, so you can store a tree of simpler and simpler words, so as not to associate a long word like "racecars" with all possible substrings, just "racecars" goes to "racecar", and that goes to like "racer" and "care", and "care" goes to "car" and "are", etc..

But that wouldn't be able to paginate efficiently.

Another approach was instead of having parent_id, you would have a left and right decimal value, and use a "nested set pattern", sort of like this:

id	word	canonical	lft	rgt
1	racecars	aaccerrss	1	14
2	racecar	aaccerr	2	13
3	race	acer	3	6
4	a	a	4	5
5	care	acer	7	10
6	ace	ace	8	9
7	car	acr	11	12

Then you could select all descendents like:

SELECT word
FROM anagram
WHERE lft > 1 AND rgt < 14;

But that would be hard to update if you dynamically added new words to the database over time, you'd have to recompute your potentially large table's lft/rgt potentially on every new word inserted, which would be slow.

So having a "string symbol" table I think, like the first approach here, seems best, though it adds a lot of database records. But that is easy to scale.

Upvotes: 0

ACV

Reputation: 10560

And here is my novel solution.

Jon Bentley’s book Programming Pearls contains a problem about finding anagrams of words. The statement:

Given a dictionary of english words, find all sets of anagrams. For instance, “pots”, “stop” and “tops” are all anagrams of one another because each can be formed by permuting the letters of the others.

I thought a bit and it came to me that the solution would be to obtain the signature of the word you’re searching and comparing it with all the words in the dictionary. All anagrams of a word should have the same signature. But how to achieve this? My idea was to use the Fundamental Theorem of Arithmetic:

The fundamental theorem of arithmetic states that

every positive integer (except the number 1) can be represented in exactly one way apart from rearrangement as a product of one or more primes

So the idea is to use an array of the first 26 prime numbers. Then for each letter in the word we get the corresponding prime number A = 2, B = 3, C = 5, D = 7 … and then we calculate the product of our input word. Next we do this for each word in the dictionary and if a word matches our input word, then we add it to the resulting list.

The performance is more or less acceptable. For a dictionary of 479828 words, it takes 160 ms to get all anagrams. This is roughly 0.0003 ms / word, or 0.3 microsecond / word. Algorithm’s complexity seems to be O(mn) or ~O(m) where m is the size of the dictionary and n is the length of the input word.

Here’s the code:

package com.vvirlan;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.Scanner;

public class Words {
    private int[] PRIMES = new int[] { 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73,
            79, 83, 89, 97, 101, 103, 107, 109, 113 };

    public static void main(String[] args) {
        Scanner s = new Scanner(System.in);
        String word = "hello";
        System.out.println("Please type a word:");
        if (s.hasNext()) {
            word = s.next();
        }
        Words w = new Words();
        w.start(word);
    }

    private void start(String word) {
        measureTime();
        char[] letters = word.toUpperCase().toCharArray();
        long searchProduct = calculateProduct(letters);
        System.out.println(searchProduct);
        try {
            findByProduct(searchProduct);
        } catch (Exception e) {
            e.printStackTrace();
        }
        measureTime();
        System.out.println(matchingWords);
        System.out.println("Total time: " + time);
    }

    private List<String> matchingWords = new ArrayList<>();

    private void findByProduct(long searchProduct) throws IOException {
        File f = new File("/usr/share/dict/words");
        FileReader fr = new FileReader(f);
        BufferedReader br = new BufferedReader(fr);
        String line = null;
        while ((line = br.readLine()) != null) {
            char[] letters = line.toUpperCase().toCharArray();
            long p = calculateProduct(letters);
            if (p == -1) {
                continue;
            }
            if (p == searchProduct) {
                matchingWords.add(line);
            }
        }
        br.close();
    }

    private long calculateProduct(char[] letters) {
        long result = 1L;
        for (char c : letters) {
            if (c < 65) {
                return -1;
            }
            int pos = c - 65;
            result *= PRIMES[pos];
        }
        return result;
    }

    private long time = 0L;

    private void measureTime() {
        long t = new Date().getTime();
        if (time == 0L) {
            time = t;
        } else {
            time = t - time;
        }
    }
}

Upvotes: 3

Parth

Reputation: 31

So here's the working solution, in Java, that Jason Cohen suggested and it performs somewhat better than the one using trie. Below are some of the main points:

Only load dictionary with the words that are subsets of given set of words
Dictionary will be a hash of sorted words as key and set of actual words as values (as suggested by Jason)
Iterate through each word from dictionary key and do a recursive forward lookup to see if any valid anagram is found for that key
Only do forward lookup because, anagrams for all the words that have already been traversed, should have already been found
Merge all the words associated to the keys for e.g. if 'enlist' is the word for which anagrams are to be found and one of the set of keys to merge are [ins] and [elt], and the actual words for key [ins] is [sin] and [ins], and for key [elt] is [let], then the final set of merge words would be [sin, let] and [ins, let] which will be part of our final anagrams list
Also to note that, this logic will only list unique set of words i.e. "eleven plus two" and "two plus eleven" would be same and only one of them would be listed in the output

Below is the main recursive code which finds the set of anagram keys:

// recursive function to find all the anagrams for charInventory characters
// starting with the word at dictionaryIndex in dictionary keyList
private Set<Set<String>> findAnagrams(int dictionaryIndex, char[] charInventory, List<String> keyList) {
    // terminating condition if no words are found
    if (dictionaryIndex >= keyList.size() || charInventory.length < minWordSize) {
        return null;
    }

    String searchWord = keyList.get(dictionaryIndex);
    char[] searchWordChars = searchWord.toCharArray();
    // this is where you find the anagrams for whole word
    if (AnagramSolverHelper.isEquivalent(searchWordChars, charInventory)) {
        Set<Set<String>> anagramsSet = new HashSet<Set<String>>();
        Set<String> anagramSet = new HashSet<String>();
        anagramSet.add(searchWord);
        anagramsSet.add(anagramSet);

        return anagramsSet;
    }

    // this is where you find the anagrams with multiple words
    if (AnagramSolverHelper.isSubset(searchWordChars, charInventory)) {
        // update charInventory by removing the characters of the search
        // word as it is subset of characters for the anagram search word
        char[] newCharInventory = AnagramSolverHelper.setDifference(charInventory, searchWordChars);
        if (newCharInventory.length >= minWordSize) {
            Set<Set<String>> anagramsSet = new HashSet<Set<String>>();
            for (int index = dictionaryIndex + 1; index < keyList.size(); index++) {
                Set<Set<String>> searchWordAnagramsKeysSet = findAnagrams(index, newCharInventory, keyList);
                if (searchWordAnagramsKeysSet != null) {
                    Set<Set<String>> mergedSets = mergeWordToSets(searchWord, searchWordAnagramsKeysSet);
                    anagramsSet.addAll(mergedSets);
                }
            }
            return anagramsSet.isEmpty() ? null : anagramsSet;
        }
    }

    // no anagrams found for current word
    return null;
}

You can fork the repo from here and play with it. There are many optimizations that I might have missed. But the code works and does find all the anagrams.

Upvotes: 3

FogleBird

Reputation: 76842

Most of these answers are horribly inefficient and/or will only give one-word solutions (no spaces). My solution will handle any number of words and is very efficient.

What you want is a trie data structure. Here's a complete Python implementation. You just need a word list saved in a file named words.txt You can try the Scrabble dictionary word list here:

http://www.isc.ro/lists/twl06.zip

MIN_WORD_SIZE = 4 # min size of a word in the output

class Node(object):
    def __init__(self, letter='', final=False, depth=0):
        self.letter = letter
        self.final = final
        self.depth = depth
        self.children = {}
    def add(self, letters):
        node = self
        for index, letter in enumerate(letters):
            if letter not in node.children:
                node.children[letter] = Node(letter, index==len(letters)-1, index+1)
            node = node.children[letter]
    def anagram(self, letters):
        tiles = {}
        for letter in letters:
            tiles[letter] = tiles.get(letter, 0) + 1
        min_length = len(letters)
        return self._anagram(tiles, [], self, min_length)
    def _anagram(self, tiles, path, root, min_length):
        if self.final and self.depth >= MIN_WORD_SIZE:
            word = ''.join(path)
            length = len(word.replace(' ', ''))
            if length >= min_length:
                yield word
            path.append(' ')
            for word in root._anagram(tiles, path, root, min_length):
                yield word
            path.pop()
        for letter, node in self.children.iteritems():
            count = tiles.get(letter, 0)
            if count == 0:
                continue
            tiles[letter] = count - 1
            path.append(letter)
            for word in node._anagram(tiles, path, root, min_length):
                yield word
            path.pop()
            tiles[letter] = count

def load_dictionary(path):
    result = Node()
    for line in open(path, 'r'):
        word = line.strip().lower()
        result.add(word)
    return result

def main():
    print 'Loading word list.'
    words = load_dictionary('words.txt')
    while True:
        letters = raw_input('Enter letters: ')
        letters = letters.lower()
        letters = letters.replace(' ', '')
        if not letters:
            break
        count = 0
        for word in words.anagram(letters):
            print word
            count += 1
        print '%d results.' % count

if __name__ == '__main__':
    main()

When you run the program, the words are loaded into a trie in memory. After that, just type in the letters you want to search with and it will print the results. It will only show results that use all of the input letters, nothing shorter.

It filters short words from the output, otherwise the number of results is huge. Feel free to tweak the MIN_WORD_SIZE setting. Keep in mind, just using "astronomers" as input gives 233,549 results if MIN_WORD_SIZE is 1. Perhaps you can find a shorter word list that only contains more common English words.

Also, the contraction "I'm" (from one of your examples) won't show up in the results unless you add "im" to the dictionary and set MIN_WORD_SIZE to 2.

The trick to getting multiple words is to jump back to the root node in the trie whenever you encounter a complete word in the search. Then you keep traversing the trie until all letters have been used.

Upvotes: 47

sanjiv

Reputation: 11

If I take a dictionary as a Hash Map as every word is unique and the Key is a binary(or Hex) representation of the word. Then if I have a word I can easily find the meaning of it with O(1) complexity.

Now, if we have to generate all the valid anagrams, we need to verify if the generated anagram is in the dictionary, if it is present in dictionary, its a valid one else we need to ignore that.

I will assume that there can be a word of max 100 characters(or more but there is a limit).

So any word we take it as a sequence of indexes like a word "hello" can be represented like "1234". Now the anagrams of "1234" are "1243", "1242" ..etc

The only thing we need to do is to store all such combinations of indexes for a particular number of characters. This is an one time task. And then words can be generated from the combinations by picking the characters from the index.Hence we get the anagrams.

To verify if the anagrams are valid or not, just index into the dictionary and validate.

The only thing need to be handled is the duplicates.That can be done easily. As an when we need to compare with the previous ones that has been searched in dictionary.

The solution emphasizes on performance.

Upvotes: 1

Jason Cohen

Reputation: 83071

For each word in the dictionary, sort the letters alphabetically. So "foobar" becomes "abfoor."

Then when the input anagram comes in, sort its letters too, then look it up. It's as fast as a hashtable lookup!

For multiple words, you could do combinations of the sorted letters, sorting as you go. Still much faster than generating all combinations.

(see comments for more optimizations and details)

Upvotes: 20

plinth

Reputation: 49199

One of the seminal works on programmatic anagrams was by Michael Morton (Mr. Machine Tool), using a tool called Ars Magna. Here is a light article based on his work.

Upvotes: 2

martinus

Reputation: 18023

A while ago I have written a blog post about how to quickly find two word anagrams. It works really fast: finding all 44 two-word anagrams for a word with a textfile of more than 300,000 words (4 Megabyte) takes only 0.6 seconds in a Ruby program.

Two Word Anagram Finder Algorithm (in Ruby)

It is possible to make the application faster when it is allowed to preprocess the wordlist into a large hash mapping from words sorted by letters to a list of words using these letters. This preprocessed data can be serialized and used from then on.

Upvotes: 0

dbkk

Reputation: 12842

As Jason suggested, prepare a dictionary making hashtable with key being word sorted alphabetically, and value word itself (you may have multiple values per key).
Remove whitespace and sort your query before looking it up.

After this, you'd need to do some sort of a recursive, exhaustive search. Pseudo code is very roughly:

function FindWords(solutionList, wordsSoFar, sortedQuery)
  // base case
  if sortedQuery is empty
     solutionList.Add(wordsSoFar)
     return

  // recursive case

  // InitialStrings("abc") is {"a","ab","abc"}
  foreach initialStr in InitalStrings(sortedQuery)
    // Remaining letters after initialStr
    sortedQueryRec := sortedQuery.Substring(initialStr.Length)
    words := words matching initialStr in the dictionary
    // Note that sometimes words list will be empty
    foreach word in words
      // Append should return a new list, not change wordSoFar
      wordsSoFarRec := Append(wordSoFar, word) 
      FindWords(solutionList, wordSoFarRec, sortedQueryRec)

In the end, you need to iterate through the solutionList, and print the words in each sublist with spaces between them. You might need to print all orderings for these cases (e.g. "I am Sam" and "Sam I am" are both solutions).

Of course, I didn't test this, and it's a brute force approach.

Upvotes: 0

user9282

Reputation: 678

The book Programming Pearls by Jon Bentley covers this kind of stuff quite nicely. A must-read.

Upvotes: 1

MartinJ

Reputation:

I've used the following way of computing anagrams a couple of month ago:

Compute a "code" for each word in your dictionary: Create a lookup-table from letters in the alphabet to prime numbers, e.g. starting with ['a', 2] and ending with ['z', 101]. As a pre-processing step compute the code for each word in your dictionary by looking up the prime number for each letter it consists of in the lookup-table and multiply them together. For later lookup create a multimap of codes to words.
Compute the code of your input word as outlined above.
Compute codeInDictionary % inputCode for each code in the multimap. If the result is 0, you've found an anagram and you can lookup the appropriate word. This also works for 2- or more-word anagrams as well.

Hope that was helpful.

Upvotes: 2

Tyler

Reputation: 28874

Pre-process:

Build a trie with each leaf as a known word, keyed in alphabetical order.

At search time:

Consider the input string as a multiset. Find the first sub-word by traversing the index trie as in a depth-first search. At each branch you can ask, is letter x in the remainder of my input? If you have a good multiset representation, this should be a constant time query (basically).

Once you have the first sub-word, you can keep the remainder multiset and treat it as a new input to find the rest of that anagram (if any exists).

Augment this procedure with memoization for faster look-ups on common remainder multisets.

This is pretty fast - each trie traversal is guaranteed to give an actual subword, and each traversal takes linear time in the length of the subword (and subwords are usually pretty darn small, by coding standards). However, if you really want something even faster, you could include all n-grams in your pre-process, where an n-gram is any string of n words in a row. Of course, if W = #words, then you'll jump from index size O(W) to O(W^n). Maybe n = 2 is realistic, depending on the size of your dictionary.

Upvotes: 5

hazzen

Reputation: 17486

See this assignment from the University of Washington CSE department.

Basically, you have a data structure that just has the counts of each letter in a word (an array works for ascii, upgrade to a map if you want unicode support). You can subtract two of these letter sets; if a count is negative, you know one word can't be an anagram of another.

Upvotes: 7

Jimmy

Reputation: 91502

How I see it:

you'd want to build a table that maps unordered sets of letters to lists words i.e. go through the dictionary so you'd wind up with, say

lettermap[set(a,e,d,f)] = { "deaf", "fade" }

then from your starting word, you find the set of letters:

 astronomers => (a,e,m,n,o,o,r,r,s,s,t)

then loop through all the partitions of that set ( this might be the most technical part, just generating all the possible partitions), and look up the words for that set of letters.

edit: hmmm, this is pretty much what Jason Cohen posted.

edit: furthermore, the comments on the question mention generating "good" anagrams, like the examples :). after you build your list of all possible anagrams, run them through WordNet and find ones that are semantically close to the original phrase :)

Upvotes: 1

Serafina Brocious

Reputation: 30619

Off the top of my head, the solution that makes the most sense would be to pick a letter out of the input string randomly and filter the dictionary based on words that start with that. Then pick another, filter on the second letter, etc. In addition, filter out words that can't be made with the remaining text. Then when you hit the end of a word, insert a space and start it over with the remaining letters. You might also restrict words based on word type (e.g. you wouldn't have two verbs next to each other, you wouldn't have two articles next to each other, etc).

Upvotes: 0

Algorithm to generate anagrams

Answers (15)

Related Questions