Reputation: 6317
I have several huge arrays (millions++ members). All those are arrays of numbers and they are not sorted (and I cannot do that). Some are uint8_t
, some uint16_t/32/64
. I would like to approximate the count of distinct values in these arrays. The conditions are following:
Examples of perfect output:
ArrayA [uint32_t, 3M members]: ~128 distinct values
ArrayB [uint32_t, 9M members]: 100000+ distinct values
ArrayC [uint8_t, 50K members]: 2-5 distinct values
ArrayD [uint8_t, 700K members]: 64+ distinct values
I understand that some of the constraints may seem illogical, but thats the way it is. As a side note, I also want the top X (3 or 10) most used and least used values, but that is far easier to do and I can do it on my own. However if someone has thoughts for that too, feel free to share them!
EDIT: a bit of clarification regarding STL. If you have a solution using it, please post it. Not using STL would be just a bonus for us, we dont fancy it too much. However if it is a good solution, it will be used!
Upvotes: 14
Views: 1177
Reputation: 1659
I've just thought of an interesting solution. It's based on law of boolean algebra called Idempotence of Multiplication, which states that:
X * X = X
From it, and using the commutative property of boolean multiplication, we can deduce that:
X * Y * X = X * X * Y = X * Y
Now, you see where I'm going to? This is how the algorithm would work (I'm terrible with pseudo-code):
make c = element1 & element2 (binary AND between the binary representation of the integers)
for i=3 until i == size_of_array make b = c & element[i]; if b != c then diferent_values++; c=b;
In first iteration, we make (element1*element2) * element3. We could represent it as:
(X * Y) * Z
If Z (element3) is equal to X (element1), then:
(X * Y) * Z = X * Y * X = X * Y
And if Z is equal to Y (element2), then:
(X * Y) * Z = X * Y * Y = X * Y
So, if Z isn´t different to X or Y, then X * Y won't change when we multiply it for Z
This remains valid for big expressions, like:
(X * A * Z * G * T * P * S) * S = X * A * Z * G * T * P * S
If we receive a value which is factor of our big multiplicand (that means that it has been already computed) then the big multiplicand won't change when we multiply it to the recieved input, so there's no new distinct value.
So that's how it will go. Each time that a different value is computed then the multiplication of our big multiplicand and that distinct value, will be different to the big operand. So, with b = c & element[i]
, if b!= c
we just increment out distinct values counter.
I guess I'm no being clear enough. If that's the case, please let me know.
Upvotes: 0
Reputation: 279255
I'm pretty sure you can do it by:
Presumably if you're simultaneously computing the top 10 most frequent values, then if there are less than 10 distinct values you'll know exactly what they are and you don't need an estimate.
I believe the "most frequently used" problem is difficult (well, memory-consuming). Suppose for a moment that you only want the top 1 most frequently used value. Suppose further that you have 10 million entries in the array, and that after the first 9.9 million of them, none of the numbers you've seen so far has appeared more than 100k times. Then any of the values you've seen so far might be the most-frequently used value, since any of them could have a run of 100k values at the end. Even worse, any two of them could have a run of 50k each at the end, in which case the count from the first 9.9 million entries is the tie-breaker between them. So in order to work out in a single pass which is the most frequently used, I think you need to know the exact count of each value that appears in the 9.9 million. You have to prepare for that freak case of a near-tie between two values in the last 0.1 million, because if it happens you aren't allowed to rewind and check the two relevant values again. Eventually you can start culling values -- if there's a value with a count of 5000 and only 4000 entries left to check, then you can cull anything with a count of 1000 or less. But that doesn't help very much.
So I might have missed something, but I think that in the worst case, the "most frequently used" problem requires you to maintain a count for every value you have seen, right up until nearly the end of the array. So you might as well use that collection of counts to work out how many distinct values there are.
Upvotes: 7
Reputation: 21086
For 8 and 16 bit it's pretty obvious, you can track every possibility every iteration.
When you get to 32 and 64 bit integers, you don't really have the memory to track every possibility.
Here's a few natural suggestions that are likely outside the bounds of your constraints.
I don't really understand why you can't sort the array. RadixSort is O(n) and once sorted it would be one more pass to get accurate distinctiveness and top X information. In reality it would be 6 passes all together for 32bit if you used a 1 byte radix (1 pass for counting + 1 * 4 passes for each byte + 1 pass for getting values).
In the same cusp as above, why not just use SQL. You could create a stored procedure that takes the array in as a table valued parameter and return the number of distinct values and the top x values in one go. This stored procedure could also be called in parallel.
-- number of distinct
SELECT COUNT(DISTINCT(n)) FROM @tmp
-- top x
SELECT TOP 10 n, COUNT(n) FROM @tmp GROUP BY n ORDER BY COUNT(n) DESC
Upvotes: 0
Reputation: 299890
One approach that can work, even for big values, is to spread them into lazily allocated buckets.
Suppose that you are working with 32
bits integers, creating an array of 2**32
bits is relatively impractical (2**29
bytes, hum). However, we can probably assume that 2**16
pointers is still reasonable (2**19
bytes: 500kB), so we create 2**16
buckets (null pointers).
The big idea therefore is to take a "sparse" approach to counting, and hope that the integers won't be to dispersed, and thus that many of the buckets pointers will remain null
.
typedef std::pair<int32_t, int32_t> Pair;
typedef std::vector<Pair> Bucket;
typedef std::vector<Bucket*> Vector;
struct Comparator {
bool operator()(Pair const& left, Pair const& right) const {
return left.first < right.first;
}
};
void add(Bucket& v, int32_t value) {
Pair const pair(value, 1);
Vector::iterator it = std::lower_bound(v.begin(), v.end(), pair, Compare());
if (it == v.end() or it->first > value) {
v.insert(it, pair);
return;
}
it->second += 1;
}
void gather(Vector& v, int32_t const* begin, int32_t const* end) {
for (; begin != end; ++begin) {
uint16_t const index = *begin >> 16;
Bucket*& bucket = v[index];
if (bucket == 0) { bucket = new Bucket(); }
add(*bucket, *begin);
}
}
Once you have gathered your data, then you can count the number of different values or find the top or bottom pretty easily.
A few notes:
If you manage to gather those statistics, then the work is cut out for you.
Upvotes: 2
Reputation: 17114
For 8- and 16-bit values, you can just make a table of the count of each value; every time you write to a table entry that was previously zero, that's a different value found.
For larger values, if you are not interested in counts above 100000, std::map
is suitable, if it's fast enough. If that's too slow for you, you could program your own B-tree.
Upvotes: 7