Ashwani K
Ashwani K

Reputation: 8000

Count unique strings in less memory

I am trying to count unique values in a process where the values are fetched from a remote source.

The values can be millions in numbers.

I am aware of the using HashSet to get the unique count, however it takes too much memory.

A sample code

long beforeUsedMem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
        Set<String> hashSet = new HashSet<>();
        for (int index = 0; index < 1000000; index++) {
            hashSet.add(UUID.randomUUID().toString());

        }
        long afterUsedMem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
        long actualMemUsed = beforeUsedMem - afterUsedMem;
        System.out.println("Items " + hashSet.size());
        System.out.println("Mem used: " + actualMemUsed / (1024 * 1024) + " MB");

For 1 million unique strings, the hashset takes around 240MB of RAM.

I can't use a DB to save these values, so querying database to get unique is out of question.

Is there any other way of getting the count of unique values?

Upvotes: 0

Views: 490

Answers (2)

n1t4chi
n1t4chi

Reputation: 492

If you are really interested in memory savings and you can accept minimal errors, check out count-distinct problem algorithms.

Best example is HyperLogLog which can use few kilobytes of memory to count milions of results with low margin( from wikipedia: 1,5kB memory for 2% error margin on 10^9 results)

Upvotes: 2

Wireless4024
Wireless4024

Reputation: 71

Here is my solution:
I create hash object to handle lots of unique hashcode

public class StringHash implements Comparable<StringHash> {
    private final int length;
    private final int hashcode;
    private final long upper;
    private final long lower;

    public StringHash(String value) {
        this.length = value.length();
        long upperTemp = 0, lowerTemp = 0;
        for (int i = 0; i < length; ++i) {
            char c = value.charAt(i);
            upperTemp = 255 * upperTemp + c;
            lowerTemp = 127 * lowerTemp + c;
        }
        this.upper = upperTemp;
        this.lower = lowerTemp;
        this.hashcode = value.hashCode();
    }

    @Override
    public int hashCode() {
        return hashcode;
    }

    @Override
    public int compareTo(StringHash o) {
        if (hashcode != o.hashcode) return Integer.compare(length, o.length);
        if (length != o.length) return Integer.compare(length, o.length);
        if (upper != o.upper) return Long.compare(upper, o.upper);
        if (lower != o.lower) return Long.compare(lower, o.lower);
        return 0;
    }

    @Override
    public boolean equals(Object obj) {
        if (obj instanceof StringHash) {
            StringHash other = ((StringHash) obj);
            return this.hashcode == other.hashcode && this.length == other.length && this.upper == other.upper &&
                   this.lower == other.upper;
        }
        return false;
    }
}
  • this object used about 24bytes + 16bytes header for any length of string
  • i implement compareTo() if you want to use in sorted array
  • same String will return same result

when I run this code

long beforeUsedMem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
Set<String> hashSet = new HashSet<>();
for (int index = 0; index < 1000000; index++)
    hashSet.add(UUID.randomUUID().toString());

long actualMemUsed = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory() - beforeUsedMem;
System.out.println("Items " + hashSet.size());
System.out.println("Mem used: " + actualMemUsed / (1024 * 1024) + " MB");

here is result

Items 1000000
Mem used: 144 MB

with my object I got

long beforeUsedMem = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory();
Set<StringHash> hashSet = new HashSet<>();
for (int index = 0; index < 1000000; index++)
    hashSet.add(new StringHash(UUID.randomUUID().toString()));

long actualMemUsed = Runtime.getRuntime().totalMemory() - Runtime.getRuntime().freeMemory() - beforeUsedMem;
System.out.println("Items " + hashSet.size());
System.out.println("Mem used: " + actualMemUsed / (1024 * 1024) + " MB");

here is result

Items 1000000
Mem used: 106 MB

Upvotes: 0

Related Questions