Albert Hendriks
Albert Hendriks

Reputation: 2143

count distinct values in big long array (performance issue)

I have this:

long hnds[] = new long[133784560]; // 133 million

Then I quickly fill the array (couple of ms) and then I somehow want to know the number of unique (i.e. distinct) values. Now, I don't even need this realtime, I just need to try out a couple of variations and see how many unique values each gives.

I tried e.g. this:

import org.apache.commons.lang3.ArrayUtils;
....
HashSet<Long> length = new HashSet<Long>(Arrays.asList(ArrayUtils.toObject(hnds)));
System.out.println("size: " + length.size());

and after waiting for half an hour it gives a heap space error (I have Xmx4000m).

I also tried initializing Long[] hnds instead of long[] hnds, but then the initial filling of the array takes forever. Or for example use a Set from the beginning when adding the values, but also then it takes forever. Is there any way to count the distinct values of a long[] array without waiting forever? I'd write it to a file if I have to, just some way.

Upvotes: 1

Views: 301

Answers (2)

Dima
Dima

Reputation: 40510

Just sort it and count.

 int sz = 133784560;
 Random randy = new Random();
 long[] longs = new long[sz];
 for(int i = 0; i < sz; i++) { longs[i] = randy.nextInt(10000000); }
 Arrays.sort(longs);
 long lastSeen = longs[0];
 long count = 0;
 for(int i = 1; i < sz; i++) {
   if(longs[i] != lastSeen) count++;
   lastSeen = longs[i];
 }

Takes about 15 seconds on my laptop.

Upvotes: 1

Louis Wasserman
Louis Wasserman

Reputation: 198211

My best suggestion would be to use a library like fastutil (http://fastutil.di.unimi.it/) and then use the custom unboxed hash set:

import it.unimi.dsi.fastutil.longs.LongOpenHashSet;
System.out.println(new LongOpenHashSet(hnds).size());

(Also, by the way, if you can accept approximate answers, there are much more efficient algorithms you can try; see e.g. this paper for details.)

Upvotes: 3

Related Questions