Baptiste Merliot
Baptiste Merliot

Reputation: 861

Counting each distinct array occurrence in a list of arrays with duplicates

PROBLEM

I have a list of arrays and I want to count the occurrences of duplicates.

For example, if I have this :

{{1,2,3},
 {1,0,3},
 {1,2,3},
 {5,2,6},
 {5,2,6},
 {5,2,6}}

I want a map (or any relevant collection) like this :

{ {1,2,3} -> 2,
  {1,0,3} -> 1,
  {5,2,6} -> 3 }

I can even lose the arrays values, I'm only interested in cardinals (e.g. 2, 1 and 3 here).

MY SOLUTION

I use the following algorithm :

Here is the code, the boolean returned tells if a new distinct array was found, I apply this function sequentially on all of my arrays :

    HashMap<int[], Long> distinctElements;
    HashMap<Integer, ArrayList<int[]>> distinctHash;

    private boolean addRow(int[] row) {

        if (distinctHash.containsKey(hash)) {
            int[] indexRow = distinctHash.get(hash).get(0);
            for (int[] previousRow: distinctHash.get(hash)) {
                if (Arrays.equals(previousRow, row)) {
                    distinctElements.put(
                            indexRow,
                            distinctElements.get(indexRow) + 1
                    );
                    return false;
                }
            }
            distinctElements.put(row, 1L);

            ArrayList<int[]> rowList = distinctHash.get(hash);
            rowList.add(row);
            distinctHash.put(hash, rowList);

            return true;

        } else {
            distinctElements.put(row, 1L);

            ArrayList<int[]> newValue = new ArrayList<>();
            newValue.add(row);
            distinctHash.put(hash, newValue);

            return true;
        }
    }

QUESTION

The problem is that my algorithm is too slow for my needs (40s for 5,000,000 arrays, and 2h-3h for 20,000,000 arrays). Profiling with NetBeans told me that the hashing takes 70% of runtime (using Google Guava murmur3_128 hash function).

Is there another algorithm that could be faster? As I said I'm not interested in arrays values, only in the number of their occurrences. I am ready to sacrifice precision for speed so a probabilistic algorithm is fine.

Upvotes: 2

Views: 95

Answers (3)

Andreas
Andreas

Reputation: 159185

Wrap the int[] in a class that implements equals and hashCode, then build Map of the wrapper class to instance count.

class IntArray {
    private int[] array;
    public IntArray(int[] array) {
        this.array = array;
    }
    @Override
    public int hashCode() {
        return Arrays.hashCode(this.array);
    }
    @Override
    public boolean equals(Object obj) {
        return (obj instanceof IntArray && Arrays.equals(this.array, ((IntArray) obj).array));
    }
    @Override
    public String toString() {
        return Arrays.toString(this.array);
    }
}

Test

int[][] input = {{1,2,3},
                 {1,0,3},
                 {1,2,3},
                 {5,2,6},
                 {5,2,6},
                 {5,2,6}};
Map<IntArray, Long> map = Arrays.stream(input).map(IntArray::new)
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
map.entrySet().forEach(System.out::println);

Output

[1, 2, 3]=2
[1, 0, 3]=1
[5, 2, 6]=3

Note: The above solution is faster and uses less memory than solution by Ravindra Ranwala, but it does require the creation of an extra class, so it is debatable which is better.

For smaller arrays, use the simpler solution below by Ravindra Ranwala.
For larger arrays, the above solution is likely better.

 Map<List<Integer>, Long> map = Stream.of(input)
         .map(a -> Arrays.stream(a).boxed().collect(Collectors.toList()))
         .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

Upvotes: 4

OmG
OmG

Reputation: 18838

If the sequence of elements for all duplication of that array is like each other and the length of each array is not much, you can map each array to an int number and using from last part of your method. Although this method decrease the time of hashing, there are some assumptions here which might not be true for your case.

Upvotes: 0

Ravindra Ranwala
Ravindra Ranwala

Reputation: 21124

You may do it like so,

Map<List<Integer>, Long> result = Stream.of(source)
        .map(a -> Arrays.stream(a).boxed().collect(Collectors.toList()))
        .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));

And here's the output,

{[1, 2, 3]=2, [1, 0, 3]=1, [5, 2, 6]=3}

Upvotes: 3

Related Questions