Reputation: 8000
I am trying to count unique values in a process where the values are fetched from a remote source.
The values can be millions in numbers.
I am aware of the using HashSet
to get the unique count, however it takes too much memory.
A sample code
long beforeUsedMem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
Set<String> hashSet = new HashSet<>();
for (int index = 0; index < 1000000; index++) {
hashSet.add(UUID.randomUUID().toString());
}
long afterUsedMem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
long actualMemUsed = beforeUsedMem - afterUsedMem;
System.out.println("Items " + hashSet.size());
System.out.println("Mem used: " + actualMemUsed / (1024 * 1024) + " MB");
For 1 million unique strings, the hashset takes around 240MB of RAM.
I can't use a DB to save these values, so querying database to get unique is out of question.
Is there any other way of getting the count of unique values?
Upvotes: 0
Views: 490
Reputation: 492
If you are really interested in memory savings and you can accept minimal errors, check out count-distinct problem algorithms.
Best example is HyperLogLog which can use few kilobytes of memory to count milions of results with low margin( from wikipedia: 1,5kB memory for 2% error margin on 10^9 results)
Upvotes: 2
Reputation: 71
Here is my solution:
I create hash object to handle lots of unique hashcode
public class StringHash implements Comparable<StringHash> {
private final int length;
private final int hashcode;
private final long upper;
private final long lower;
public StringHash(String value) {
this.length = value.length();
long upperTemp = 0, lowerTemp = 0;
for (int i = 0; i < length; ++i) {
char c = value.charAt(i);
upperTemp = 255 * upperTemp + c;
lowerTemp = 127 * lowerTemp + c;
}
this.upper = upperTemp;
this.lower = lowerTemp;
this.hashcode = value.hashCode();
}
@Override
public int hashCode() {
return hashcode;
}
@Override
public int compareTo(StringHash o) {
if (hashcode != o.hashcode) return Integer.compare(length, o.length);
if (length != o.length) return Integer.compare(length, o.length);
if (upper != o.upper) return Long.compare(upper, o.upper);
if (lower != o.lower) return Long.compare(lower, o.lower);
return 0;
}
@Override
public boolean equals(Object obj) {
if (obj instanceof StringHash) {
StringHash other = ((StringHash) obj);
return this.hashcode == other.hashcode && this.length == other.length && this.upper == other.upper &&
this.lower == other.upper;
}
return false;
}
}
compareTo()
if you want to use in sorted array when I run this code
long beforeUsedMem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
Set<String> hashSet = new HashSet<>();
for (int index = 0; index < 1000000; index++)
hashSet.add(UUID.randomUUID().toString());
long actualMemUsed = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory() - beforeUsedMem;
System.out.println("Items " + hashSet.size());
System.out.println("Mem used: " + actualMemUsed / (1024 * 1024) + " MB");
here is result
Items 1000000
Mem used: 144 MB
with my object I got
long beforeUsedMem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
Set<StringHash> hashSet = new HashSet<>();
for (int index = 0; index < 1000000; index++)
hashSet.add(new StringHash(UUID.randomUUID().toString()));
long actualMemUsed = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory() - beforeUsedMem;
System.out.println("Items " + hashSet.size());
System.out.println("Mem used: " + actualMemUsed / (1024 * 1024) + " MB");
here is result
Items 1000000
Mem used: 106 MB
Upvotes: 0