sonoerin
sonoerin

Reputation: 5175

Java Anagram running out of memory

I am trying to solve the age old anagram problem. Thanks to many tutorials out there, I am able to iterate through a Set of Strings, recursively find all permutations, then compare them against a list of English words. The problem I am finding is that after about three words (usually on something like "anamorphosis"), I get a OutOfMemory error. I tried breaking my batches into small sets because it appears to be the recursive part consuming all my memory. But even just "anamorphosis" locks it up...

Here I read the words from a file into a List

Scanner scanner = new Scanner(resource.getInputStream());
   while (scanner.hasNext()) {
       String s = scanner.nextLine();
        uniqueWords.add(s.toLowerCase());
   }

Now I break them into smaller sets and call a class to generate anagrams:

List<List<String>> subSets = Lists.partition(new ArrayList(uniqueWords), SET_SIZE);

for (List<String> set: subSets) {
      // tried created as class attribute & injection, no difference 
      AnagramGenerator anagramGenerator = new AnagramGenerator();
      List<Word> anagrams = anagramGenerator.createWordList(set);
      wordsRepository.save(anagrams);
      LOGGER.info("Inserted {} records into the database", anagrams.size());
 }

And finally my generator:

public class AnagramGenerator {

private Map<String, List<String>> map = new Hashtable<>();
public List<Word> createWordList(List<String> dictionary) {

   buildAnagrams(dictionary);

   List<Word> words = new ArrayList<>();
   for (Map.Entry<String, List<String>> entry : map.entrySet()) {
       words.add(new Word(entry.getKey(), entry.getValue()));
   }
    return words;
   }

private Map<String, List<String>> buildAnagrams(List<String> dictionary) {

        for (String str : dictionary) {
            String key = sortString(str);
            if (map.get(key) != null) {
                map.get(key).add(str.toLowerCase());
            } else {
                if (str.length() < 2) {
                    map.put(key, new ArrayList<>());
                } else {
                    Set<String> permutations = permutations(str);
                    Set<String> anagramList = new HashSet<>();

                    for (String temp : permutations) {
                        if (dictionary.contains(temp) && !temp.equalsIgnoreCase(str)) {
                            anagramList.add(temp);
                        }
                    }
                    map.put(key, new ArrayList<>(anagramList));
                }
            }
        }
        return map;
    }

   private Set<String> permutations(String str) {    
        if (str.isEmpty()) {
            return Collections.singleton(str);
        } else {
            Set<String> set = new HashSet<>();
            for (int i = 0; i < str.length(); i++)
                for (String s : permutations(str.substring(0, i) + str.substring(i + 1)))
                    set.add(str.charAt(i) + s);
            return set;
        }
    }

Edit: Based upon the excellent feedback I have changed my generator from a permutations to a work lookup:

public class AnagramGenerator {
private Map<String, Set<String>> groupedByAnagram = new HashMap<String, Set<String>>();

    private Set<String> dictionary;

    public AnagramGenerator(Set<String> dictionary) {

        this.dictionary = dictionary;
    }

 public List<Word> searchAlphabetically() {

        List<Word> words = new ArrayList<>();
        for (String word : dictionary) {
            String key = sortString(word);
            if (!groupedByAnagram.containsKey(key)) {
                groupedByAnagram.put(key, new HashSet<>());
            }
            if (!word.equalsIgnoreCase(key)) {
                groupedByAnagram.get(key).add(word);
            }
        }

        for (Map.Entry<String, Set<String>> entry : groupedByAnagram.entrySet()) {
            words.add(new Word(entry.getKey(), new ArrayList(entry.getValue())));
        }

        return words;
    }
 private String sortString(String goodString) {

        char[] letters = goodString.toLowerCase().toCharArray();
        Arrays.sort(letters);
        return new String(letters);
    }

It has a bit more tweaking so I don't add a word as it's own anagram, but otherwise this appears to be blazing fast. And, the code is much cleaner. Thanks everyone!

Upvotes: 1

Views: 342

Answers (3)

Ghislain Fourny
Ghislain Fourny

Reputation: 7279

Here is an answer that combines slim's approach with mine, "Pseudo-Java code":

Map<String, Set<String>> groupedByAnagram = new HashMap<String, Set<String>>();

for(String word: dictionary)
{
  String footprint = sort_alphabetically(word);
  if(!groupedByAnagram.contains(footprint))
  {
    groupedByAnagram.put(footprint, new HashSet<String>>());
  }
  groupedByAnagram.get(footprint).insert(word); 
}

for(Set<String> anagram: groupedByAnagram.values())
{
  if(anagram.size() > 1)
  {
    System.out.println("Anagram found.");
    for (String word: anagram)
    {
      System.out.println(word);
    }
  } 
}

It first builds an index of all words by "anagram fingerprint" (slim's idea) and then goes through it, only outputting entries with more than one word.

Upvotes: 1

slim
slim

Reputation: 41213

As noted for longer words, the number of permutations soon becomes enormous.

/usr/share/dict/british-english on Debian has 99,156 lines. There are longer word lists, but let's use that as an example.

The number of permutations for a nine letter word is 9! = 362,880

Therefore, for words of 9 letters or more, it's less computational effort to try every word in the dictionary, than it is to try every permutation of the input word.

10! milliseconds = ~1 hour
12! milliseconds = ~5.54 days
15! milliseconds = ~41.44 years

And you'd be lucky to process one permutation per ms, so you can see you soon get to a number of permutations that are completely impractical to work with. The impact on stack and heap mounts up at the same rate.

So, try the algorithm (pseudocode):

 sorted_input = sort_alphabetically(input_word)
 for each dictionary_word // probably a file readline()
     sorted_dictionary_word = sort_alphabetically(dictionary_word)
     if(sorted_dictionary_word = sorted_input)
         it's an anagram! Handle it
     end 
 end

Similarly, you can fairly quickly write all dictionary-word algorithms into a lookup data structure. Pseudocode again; in Java you could use a Map<String, List<String>> or a MultiMap from Apache Commons or Guava:

  multimap = new MultiMap<String, String> // or whatever

  def build_dict:
      for each dictionary_word // probably a file readline()
          multimap.add(
               sort_alphabetically(dictionary_word), 
               dictionary_word)
      end
  end

  def lookup_anagrams(word):
      return multimap.get(sort_alphabetically(word))
  end 

This takes up a moderate amount of memory (the whole dictionary, plus a bit for the keys and map overheads), but means that once the structure is created, you can query over and over again very cheaply indeed.

If you want to find two-word anagrams, you will need a more complicated and interesting algorithm. But even then, avoiding brute-forcing the entire search-space of permutations is vital to your success.

Upvotes: 4

Ghislain Fourny
Ghislain Fourny

Reputation: 7279

Doing a quick calculation: "anamorphosis" has 12 letters, which gives 12! = 479,001,600 permutations. Each string takes at least 12 bytes (assuming, say, UTF-8 with ASCII characters only), meaning a total size of 12 * 479,001,600 bytes, which is roughly 6 GB.

Now, the default heap size is, as far as I know, set to 1GB or (if smaller) one quarter of available memory. This is less than the required 6GB.

There are two ways out of this:

  • increase heap size when executing the program, but it will not work for longer words as permutations grow exponentially: with just one more letter, "accomplishing" already requires 78GB.

  • streaming through permutations rather than materializing them into a set of strings. Concretely, this means still using a recursion, but instead of storing each recursively generated permutation, it is processed immediately, then forgotten when moving on to the next one.

Now, if it needs to be done against the entire dictionary, another approach if you have access to a cluster, could be to compute the cartesian product of the dictionary with itself, store it on a distributed file system like HDFS (should be in the order of magnitude of a billion entries), then go through all pairs in parallel using MapReduce and output the pairs that are anagrams from one another. It's more efforts, but the complexity goes down from exponential in the length of the words, to quadratic in the size of the dictionary.

Upvotes: 1

Related Questions