Rajeev
Rajeev

Reputation: 5002

Generate same unique hash code for all anagrams

Recently, I attended an interview and faced a good question regarding hash collisions.

Question : Given a list of strings, print out the anagrams together.

Example :

i/p :              {act, god, animal, dog, cat}

o/p :                 act, cat, dog, god


I want to create hashmap and put the word as key and value as list of anagrams

To avoid collision, I want to generate unique hash code for anagrams instead of sorting and using the sorted word as key.

I am looking for hash algorithm which take care of collision other than using chaining. I want algorithm to generate same hash code for both act and cat... so that it will add next word to the value list

Can anyone suggest a good algorithm ?

Upvotes: 19

Views: 24980

Answers (8)

Pankaj Dwivedi
Pankaj Dwivedi

Reputation: 159

We can use the binary value representation of array. This code snippet is assuming all characters are small latin characters.

public int hashCode() {
    //TODO: so that each set of anagram generates same hashCode
    int sLen = s.length();
    int [] ref = new int[26];
    for(int i=0; i< sLen; i++) {
      ref[s.charAt(i) - 'a'] +=1;
    }
    int hashCode = 0;
    for(int i= 0; i < ref.length; i++) {
      hashCode += new Double(Math.pow(2, i)).intValue() * ref[i];
    }
    return hashCode;
}

Upvotes: 0

Brijith B
Brijith B

Reputation: 1

create the hascode in following way

String hash(String s){
    char[] hashValue = new char[26];
    for(char c : s.toCharArray()){
        hash[c-'a']++;
    }
    return new String(hashValue);
}

here the hash will be initialized with the default value of char u0000 and an increment will make the value to the next Unicode. since its a char array we can convert it to string and use it as the key

Upvotes: 0

SRobertJames
SRobertJames

Reputation: 9253

The complexity above seems very misplaced! You don't need prime numbers or hashes. It's just three simple ops:

  1. Map each OriginalWord to a Tuple of (SortedWord, OriginalWord). Example: "cat" becomes ("act", "cat"); "dog" becomes ("dgo", "dog"). This is a simple sort on the chars of each OriginalWord.
  2. Sort the Tuples by their first element. Example: ("dgo", "dog"), ("act, "cat") sorts to ("act", "cat"), ("dgo", "dog"). This is a simple sort on the entire collection.
  3. Iterate through the tuples (in order), emitting the OriginalWord. Example: ("act", "cat"), ("dgo", "dog") emits "cat" "dog". This is a simple iteration.

Two iterations and two sorts are all it takes!

In Scala, it's exactly one line of code:

val words = List("act", "animal", "dog", "cat", "elvis", "lead", "deal", "lives", "flea", "silent", "leaf", "listen")

words.map(w => (w.toList.sorted.mkString, w)).sorted.map(_._2)
# Returns: List(animal, act, cat, deal, lead, flea, leaf, dog, listen, silent, elvis, lives)

Or, as the original question implies, you only want cases where the count > 1, it's just a bit more:

scala> words.map(w => (w.toList.sorted.mkString, w)).groupBy(_._1).filter({case (k,v) => v.size > 1}).mapValues(_.map(_._2)).values.toList.sortBy(_.head)
res64: List[List[String]] = List(List(act, cat), List(elvis, lives), List(flea, leaf), List(lead, deal), List(silent, listen))

Upvotes: 2

Paul Chernoch
Paul Chernoch

Reputation: 5553

The other posters suggested converting characters into prime numbers and multiplying them together. If you do this modulo a large prime, you get a good hash function that won't overflow. I tested the following Ruby code against the Unix word list of most English words and found no hash collisions between words that are not anagrams of one another. (On MAC OS X, this file is located here: /usr/share/dict/words.)

My word_hash function takes the ordinal value of each character mod 32. This will make sure that uppercase and lowercase letters have the same code. The large prime I use is 2^58 - 27. Any large prime will do so long as it is less than 2^64 / A where A is my alphabet size. I am using 32 as my alphabet size, so this means I can't use a number larger than about 2^59 - 1. Since ruby uses one bit for sign and a second bit to indicate if the value is a number or an object, I lose a bit over other languages.

def word_hash(w)
  # 32 prime numbers so we can use x.ord % 32. Doing this, 'A' and 'a' get the same hash value, 'B' matches 'b', etc for all the upper and lower cased characters.
  # Punctuation gets assigned values that overlap the letters, but we don't care about that much.
  primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103,107,109,113,127,131]
  # Use a large prime number as modulus. It must be small enough so that it will not overflow if multiplied by 32 (2^5). 2^64 / 2^5 equals 2^59, so we go a little lower.
  prime_modulus = (1 << 58) - 27
  h = w.chars.reduce(1) { |memo,letter| memo * primes[letter.ord % 32] % prime_modulus; }
end

words = (IO.readlines "/usr/share/dict/words").map{|word| word.downcase.chomp}.uniq
wordcount = words.size
anagramcount = words.map { |w| w.chars.sort.join }.uniq.count

whash = {}
inverse_hash = {}
words.each do |w|
  h = word_hash(w)
  whash[w] = h
  x = inverse_hash[h]
  if x && x.each_char.sort.join != w.each_char.sort.join
    puts "Collision between #{w} and #{x}"
  else
    inverse_hash[h] = w
  end
end
hashcount = whash.values.uniq.size
puts "Unique words (ignoring capitalization) = #{wordcount}. Unique anagrams = #{anagramcount}. Unique hash values = #{hashcount}."

Upvotes: 3

Leeor
Leeor

Reputation: 19716

Hashing with the sorted string is pretty nice, i'd have done that probably, but it could indeed be slow and cumbersome. Here's another thought, not sure if it works - pick a set of prime numbers, as small as you like, the same size as your character set, and build a fast mapping function from your chars to that. Then for a given word, map each character into the matching prime, and multiply. finally, hash using the result.

This is very similar to what Heuster suggested, only with less collisions (actually, I believe there will be no false collisions, given the uniqueness of the prime decomposition of any number).

simple e.g. -

int primes[] = {2, 3, 5, 7, ...} // can be auto generated with a simple code

inline int prime_map(char c) {
    // check c is in legal char set bounds
    return primes[c - first_char];
}

...
char* word = get_next_word();
char* ptr = word;
int key = 1;
while (*ptr != NULL) {
    key *= prime_map(*ptr);
    ptr++;
}
hash[key].add_to_list(word);

[edit]

A few words about the uniqueness - any integer number has a single breakdown to multiplications of primes, so given an integer key in the hash you can actually reconstruct all possible strings that would hash to it, and only these words. Just break into primes, p1^n1 * p2^n2 * ... and convert each prime to the matching char. the char for p1 would appear n1 times, and so on. You can't get any new prime you didn't explicitly used, being prime means you can't get it by any multiplication of other primes.

This brings another possible improvement - if you can construct the string, you just need to mark the permutations you saw when populating the hash. since the permutations can be ordered by lexicographic order, you can replace each one with a number. This would save the space of storing the actual strings in the hash, but would require more computations so it's not necessarily a good design choice. Still, it makes a nice complication of the original question for interviews :)

Upvotes: 32

Saakshaat Singh
Saakshaat Singh

Reputation: 5

The solution using product of primes is brilliant and here's a Java implementation incase anyone needs one.

class HashUtility {
    private int n;
    private Map<Character, Integer> primeMap;

    public HashUtility(int n) {
        this.n = n;
        this.primeMap = new HashMap<>();
        constructPrimeMap();
    }

    /**
     * Utility to check if the passed {@code number} is a prime.
     *
     * @param number The number which is checked to be prime.
     * @return {@link boolean} value representing the prime nature of the number.
     */
    private boolean isPrime(int number) {
        if (number <= 2)
            return number == 2;
        else
            return (number % 2) != 0
                    &&
                    IntStream.rangeClosed(3, (int) Math.sqrt(number))
                            .filter(n -> n % 2 != 0)
                            .noneMatch(n -> (number % n == 0));
    }

    /**
     * Maps all first {@code n} primes to the letters of the given language.
     */
    private void constructPrimeMap() {
        List<Integer> primes = IntStream.range(2, Integer.MAX_VALUE)
                .filter(this::isPrime)
                .limit(this.n)      //Limit the number of primes here
                .boxed()
                .collect(Collectors.toList());

        int curAlphabet = 0;
        for (int i : primes) {
            this.primeMap.put((char) ('a' + curAlphabet++), i);
        }
    }

    /**
     * We calculate the hashcode of a word by calculating the product of each character mapping prime. This works since
     * the product of 2 primes is unique from the products of any other primes.
     * <p>
     * Since the hashcode can be huge, we return it modulo a large prime.
     *
     * @param word The {@link String} to be hashed.
     * @return {@link int} representing the prime hashcode associated with the {@code word}
     */
    public int hashCode(String word) {
        long primeProduct = 1;
        long mod = 100000007;
        for (char currentCharacter : word.toCharArray()) {
            primeProduct *= this.primeMap.get(currentCharacter) % mod;
        }

        return (int) primeProduct;
    }
}

Please let me know if/how I can improve this.

Upvotes: 0

Utsav Chokshi
Utsav Chokshi

Reputation: 1395

Small practical Optimization , I would suggest for the above hash method is :

Assign least prime number to vowels and then most frequently occurring consonants. Ex : e : 2 a : 3 i : 5 o : 7 u : 11 t : 13 and so on...

Also, average word length for english is : ~ 6

Also, top 26 prime numbers are less than 100 [2,3,5,7, .. , 97]

Hence, on average your hash would generate value around 100^6 = 10^12.

So there are very less chances of collision if you take prime number for modulo bigger than 10^12.

Upvotes: 3

Rajeev
Rajeev

Reputation: 5002

Hash function : Assign primary numbers to each character. While calculating hash code, get the prime number assigned to that character and multiply with to existing value.Now all anagrams produce same hash value.

ex : a - 2, c - 3 t - 7

hash value of cat = 3*2*7 = 42 hash value of act = 2*3*7 = 42 Print all strings which are having same hash value(anagrams will have same hash value)

Upvotes: 6

Related Questions