umop apisdn
umop apisdn

Reputation: 683

Potential hashing problems when using hashCode and Arrays.equals

As the comment in my code explains, the task is to find the number of pairs of strings from a given input file which are permutations of each other. For example, "ABCD" and "BCDA" are permutations of each other, meaning that a pair has been found.

The main bulk of my program is then as follow:

/**
 * Finds the number of pairs of strings that are permutations of each other.
 * 
 * A hash map is created with a hash code generated from the array formed using the getFrequency
 * method as key and a pair containing a string array and the number of times a permutation of that 
 * particular string array has been found as value.
 * 
 * If a permutation is already in the hash table previously, increment the counter.
 */
public static int findPairs(String fileName) {
    try {
        //Sets up the necessary file readers
        FileReader dataFile = new FileReader(fileName);
        BufferedReader bufferedDataFile = new BufferedReader(dataFile);

        String line = bufferedDataFile.readLine();

        //Finds the number of entries in the file
        int num = Integer.parseInt(line);

        int counter = 0;
        int accumulator = 0;

        HashMap<Integer, Pair> store = new HashMap<>();

        for (int i = 0; i < num; i++) {
            String current = bufferedDataFile.readLine();
            int[] currentArr = getFrequency(current);
            int currHashCode = Arrays.hashCode(currentArr);

            if (store.containsKey(currHashCode)) {
                Pair pairToCheck = store.get(currHashCode);
                int[] arrToCheck = pairToCheck.getArr();

                //Double checking, in case there is a collision and unequal arrays 
                //have the same hashCode
                if (Arrays.equals(currentArr, arrToCheck)) {
                    counter = pairToCheck.getCount();
                    pairToCheck.updateCount();
                } else {
                    //if the current bucket is not empty, and not a permutation of the input string,
                    //continue to conduct a linear  probe
                    while (pairToCheck != null && !Arrays.equals(currentArr, arrToCheck)) {
                        currHashCode++;
                        pairToCheck = store.get(currHashCode);
                        arrToCheck = pairToCheck.getArr();
                    }

                    //if the current bucket is empty, add the new pair into the position
                    if (pairToCheck == null) {
                        counter = 0;
                    //otherwise, a permutation has been found later in the linear probe!
                    } else {
                        counter = pairToCheck.getCount();
                        pairToCheck.updateCount();
                    }
                }
            //no such permutation in the hash table yet!    
            } else {
                counter = 0;
            }

            //Updates the accumulator using the counter. If there were already other strings
            //which are permutations of the current string, the current string will be able to
            //form a pair with each of these strings.
            accumulator += counter;

            //Updates the hash map only if the permutation has not been stored previously
            if (counter == 0) {
                Pair newPair = new Pair(currentArr, 1);
                store.put(currHashCode, newPair);
            }
        }

        //Close the file reader
        bufferedDataFile.close();

        return accumulator;
    } catch (Exception e) {
        System.out.println(e);
    }

    //In the event of an error, return -1
    return -1;
}

What are some potential problems which can result from such manipulation of Java's hashCode and Arrays implementations? This is particularly because I have been given some private test cases to pass, and while I can pass a number of them, there's one which I repeatedly fail. I suspect it has to do with the way I am dealing with collisions... But although I have inspected this multiple times, I am still uncertain where the error might possibly lie. Any help is much appreciated!

EDIT: As per request, here is my getFrequency method:

public static int[] getFrequency(String s) {
    //There are 128 legal ascii characters
    int[] charArr = new int[128];

    //Iterate through the given string, and increment the count for a character using its 
    //ascii value to locate its position in the array
    for (int i = 0; i < s.length(); i++) {

        char c = s.charAt(i);
        int ascii = (int) c;
        charArr[ascii] += 1;    
    }

    return charArr;
}

EDIT 2: And Pair:

public class Pair {

   private int[] m_arr;
   private int m_count;

   public Pair(int[] arr, int count) {
       this.m_arr = arr;
       this.m_count = count;
   }

   public int[] getArr() {
       return this.m_arr;
   }

   public int getCount() {
       return this.m_count;
   }

   public void updateCount() {
       this.m_count++;
   }

}

Upvotes: 1

Views: 163

Answers (1)

rossum
rossum

Reputation: 15693

Finding anagrams is a known problem. The usual solution is to sort the strings and compare sorted strings. When you sort, "ABCD" and "BCDA" both become "ABCD".

Storing the sorted strings in a set will let you find matches easily. Make a class that keeps the string in its sorted and unsorted versions separately for easy retrieval of the unsorted version of the string.

Your hash function is not good, since "BB" will hash to the same value as "AC". Use a better hash function on the sorted version of the string.

Upvotes: 2

Related Questions