PeterK
PeterK

Reputation: 6317

How to approximate the count of distinct values in an array in a single pass through it

I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are uint8_t, some uint16_t/32/64. I would like to approximate the count of distinct values in these arrays. The conditions are following:

  1. speed is VERY important, I need to do this in one pass through the array and I must go through it sequentially (cannot jump back and forth) (I am doing this in C++, if that's important)
  2. I don't need EXACT counts. What I want to know is that if it is an uint32_t array if there are like 10 or 20 distinct numbers or if there are thousands or millions.
  3. I have quite a bit of memory that I can use, but the less is used the better
  4. the smaller the array data type, the more accurate I need to be
  5. I dont mind STL, but if I can do it without it that would be great (no BOOST though, sorry)
  6. if the approach can be easily parallelized, that would be cool (but its not a mandatory condition)

Examples of perfect output:

ArrayA [uint32_t, 3M members]: ~128 distinct values
ArrayB [uint32_t, 9M members]: 100000+ distinct values
ArrayC [uint8_t, 50K members]: 2-5 distinct values
ArrayD [uint8_t, 700K members]: 64+ distinct values

I understand that some of the constraints may seem illogical, but thats the way it is. As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them!

EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!

Upvotes: 14

Views: 1177

Answers (5)

Alfredo Castaneda Garcia
Alfredo Castaneda Garcia

Reputation: 1659

I've just thought of an interesting solution. It's based on law of boolean algebra called Idempotence of Multiplication, which states that:

X * X = X

From it, and using the commutative property of boolean multiplication, we can deduce that:

X * Y * X = X * X * Y = X * Y

Now, you see where I'm going to? This is how the algorithm would work (I'm terrible with pseudo-code):

  1. make c = element1 & element2 (binary AND between the binary representation of the integers)

  2. for i=3 until i == size_of_array make b = c & element[i]; if b != c then diferent_values++; c=b;

In first iteration, we make (element1*element2) * element3. We could represent it as:

(X * Y) * Z

If Z (element3) is equal to X (element1), then:

(X * Y) * Z = X * Y * X = X * Y

And if Z is equal to Y (element2), then:

(X * Y) * Z = X * Y * Y = X * Y

So, if Z isn´t different to X or Y, then X * Y won't change when we multiply it for Z

This remains valid for big expressions, like:

(X * A * Z * G * T * P * S) * S = X * A * Z * G * T * P * S

If we receive a value which is factor of our big multiplicand (that means that it has been already computed) then the big multiplicand won't change when we multiply it to the recieved input, so there's no new distinct value.

So that's how it will go. Each time that a different value is computed then the multiplication of our big multiplicand and that distinct value, will be different to the big operand. So, with b = c & element[i], if b!= c we just increment out distinct values counter.

I guess I'm no being clear enough. If that's the case, please let me know.

Upvotes: 0

Steve Jessop
Steve Jessop

Reputation: 279255

I'm pretty sure you can do it by:

  1. Create a Bloom filter
  2. Run through the array inserting each element into the filter (this is a "slow" O(n), since it requires computing several independent decent hashes of each value)
  3. Count how many bits are set in the Bloom Filter
  4. Compute back from the density of the filter to an estimate of the number of distinct values. I don't know the calculation off the top of my head, but any treatment of the theory of Bloom filters goes into this, because it's vital to the probability of the filter giving a false positive on a lookup.

Presumably if you're simultaneously computing the top 10 most frequent values, then if there are less than 10 distinct values you'll know exactly what they are and you don't need an estimate.

I believe the "most frequently used" problem is difficult (well, memory-consuming). Suppose for a moment that you only want the top 1 most frequently used value. Suppose further that you have 10 million entries in the array, and that after the first 9.9 million of them, none of the numbers you've seen so far has appeared more than 100k times. Then any of the values you've seen so far might be the most-frequently used value, since any of them could have a run of 100k values at the end. Even worse, any two of them could have a run of 50k each at the end, in which case the count from the first 9.9 million entries is the tie-breaker between them. So in order to work out in a single pass which is the most frequently used, I think you need to know the exact count of each value that appears in the 9.9 million. You have to prepare for that freak case of a near-tie between two values in the last 0.1 million, because if it happens you aren't allowed to rewind and check the two relevant values again. Eventually you can start culling values -- if there's a value with a count of 5000 and only 4000 entries left to check, then you can cull anything with a count of 1000 or less. But that doesn't help very much.

So I might have missed something, but I think that in the worst case, the "most frequently used" problem requires you to maintain a count for every value you have seen, right up until nearly the end of the array. So you might as well use that collection of counts to work out how many distinct values there are.

Upvotes: 7

Louis Ricci
Louis Ricci

Reputation: 21086

For 8 and 16 bit it's pretty obvious, you can track every possibility every iteration.

When you get to 32 and 64 bit integers, you don't really have the memory to track every possibility.

Here's a few natural suggestions that are likely outside the bounds of your constraints.

I don't really understand why you can't sort the array. RadixSort is O(n) and once sorted it would be one more pass to get accurate distinctiveness and top X information. In reality it would be 6 passes all together for 32bit if you used a 1 byte radix (1 pass for counting + 1 * 4 passes for each byte + 1 pass for getting values).

In the same cusp as above, why not just use SQL. You could create a stored procedure that takes the array in as a table valued parameter and return the number of distinct values and the top x values in one go. This stored procedure could also be called in parallel.

-- number of distinct
SELECT COUNT(DISTINCT(n)) FROM @tmp
-- top x
SELECT TOP 10 n, COUNT(n) FROM @tmp GROUP BY n ORDER BY COUNT(n) DESC

Upvotes: 0

Matthieu M.
Matthieu M.

Reputation: 299890

One approach that can work, even for big values, is to spread them into lazily allocated buckets.

Suppose that you are working with 32 bits integers, creating an array of 2**32 bits is relatively impractical (2**29 bytes, hum). However, we can probably assume that 2**16 pointers is still reasonable (2**19 bytes: 500kB), so we create 2**16 buckets (null pointers).

The big idea therefore is to take a "sparse" approach to counting, and hope that the integers won't be to dispersed, and thus that many of the buckets pointers will remain null.

typedef std::pair<int32_t, int32_t> Pair;
typedef std::vector<Pair> Bucket;
typedef std::vector<Bucket*> Vector;

struct Comparator {
  bool operator()(Pair const& left, Pair const& right) const {
    return left.first < right.first;
  }
};

void add(Bucket& v, int32_t value) {
  Pair const pair(value, 1);
  Vector::iterator it = std::lower_bound(v.begin(), v.end(), pair, Compare());
  if (it == v.end() or it->first > value) {
    v.insert(it, pair);
    return;
  }

  it->second += 1;
}

void gather(Vector& v, int32_t const* begin, int32_t const* end) {
  for (; begin != end; ++begin) {
    uint16_t const index = *begin >> 16;

    Bucket*& bucket = v[index];

    if (bucket == 0) { bucket = new Bucket(); }

    add(*bucket, *begin);
  }
}

Once you have gathered your data, then you can count the number of different values or find the top or bottom pretty easily.

A few notes:

  • the number of buckets is completely customizable (thus letting you control the amount of original memory)
  • the strategy of repartition is customizable as well (this is just a cheap hash table I have made here)
  • it is possible to monitor the number of allocated buckets and abandon, or switch gear, if it starts blowing up.
  • if each value is different, then it just won't work, but when you realize it, you will already have collected many counts, so you'll at least be able to give a lower bound of the number of different values, and a you'll also have a starting point for the top/bottom.

If you manage to gather those statistics, then the work is cut out for you.

Upvotes: 2

TonyK
TonyK

Reputation: 17114

For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.

For larger values, if you are not interested in counts above 100000, std::map is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.

Upvotes: 7

Related Questions