Reputation: 150138
I am receiving a stream of unordered Int32 values and need to track the count of distinct values that I receive.
My thought is to add the Int32 values into a HashSet<Int32>
. Duplicate entries will simply not be added per the behavior of HashSet.
Do I understand correctly that set membership is based on GetHashCode() and that the hash code of an Int32 is the number itself?
Is there an approach that is either more CPU or more memory efficient?
UPDATE
The data stream is rather large. Simply using Linq to iterate the stream to get the distinct count is not what I'm after, since that would involve iterating the stream a second time.
Upvotes: 3
Views: 188
Reputation: 10575
Don't really know your domain, but there are some algorithms to calculate cardinality of large sets using very small memory and processing.
I'm using HyperLogLog in a project of mine. I use it to count several million of distinct items using as low as 8KB of memory with 1% error.
Here is a paper describing it:
http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
I've implemented it in Java and Python. The Python version is opensource and the algorithm is rather small. Check it out:
https://github.com/juanplopes/hyperloglog.py/blob/master/hyperloglog.py
Upvotes: 1
Reputation: 27429
One thought, if you have a very large stream of data (millions to billions) is to use a Bloom filter. This will provide you with an ability to determine an approximate count as you stream the data, and if you have the need for an exact count, you can process it offline.
A reasonable C# implementation is here: http://bloomfilter.codeplex.com/
Upvotes: 1
Reputation: 150138
I appreciate the other answers, but find that the original approach of using a HashSet<T>
is most appropriate for my situation.
It is not efficient to re-iterate the stream to get the distinct count.
Upvotes: 0
Reputation: 838656
Assuming you have some sort of IEnumerable<int>
you can do the following:
int count = stream.Distinct().Count();
Do I understand correctly that set membership is based on GetHashCode()
Not quite. Membership in a HashSet
is based on a combination of GetHashCode
and an equality check. In general, two objects can have the same hashcode but not be equal. Though for int
that cannot happen.
and that the hash code of an Int32 is the number itself?
Yes, that is correct.
Is there an approach that is either more CPU or more memory efficient?
If you know that your ints will be in a small range, you can efficiently store which you have seen by using a bitmap. For example, if you have a range of 1,000,000 you can store which ints you have seen in 1,000,000 bits. A bit set to 1 at index n means that you have seen the integer n. Here's some example code showing one way to implement this:
void Main()
{
int max = 1000000;
IEnumerable<int> stream = GetStream(max);
int count = DistinctCount(stream, max);
int count2 = stream.Distinct().Count();
Debug.Assert(count == count2);
}
int DistinctCount(IEnumerable<int> stream, int max)
{
int[] seen = new int[max / 32];
foreach (int x in stream)
{
seen[x / 32] |= 1 << (x % 32);
}
int count = 0;
foreach (uint s in seen)
{
uint t = s;
while (t > 0)
{
if (t % 2 == 1) { count++; }
t /= 2;
}
}
return count;
}
IEnumerable<int> GetStream(int max)
{
List<int> stream = new List<int>();
Random random = new Random();
for (int i = 0; i < 2000000; ++i)
{
stream.Add(random.Next(max));
}
return stream;
}
Upvotes: 4
Reputation: 2848
I assume that you receive the values in chunks, be it one int at a time to a bunch of ints.
given that, the simplest thing is probably the best, I'd use a hash too. However I don't see how you can use a HashSet. If you want the count of distinct values, you'd only get the found values
Dictionary<int,int> _countHash = new Dictionary<int,int>();
void moreIntsArrived(IEnumerable<int> bunch)
{
foreach(var value in bunch)
{
if (_countHash.ContainsKey(value))
{
_countHash[value] += _countHash[value];
}
else
{
_countHash[value] = 0;
}
}
}
However, do what Mr Hansleman suggests, measure it
There is probably a trade off between doing the ContainsKey check and just take the hit of the exception when the key is not found, IF your stream is large enough to stop getting new unique values
void moreIntsArrived(IEnumerable<int> bunch)
{
foreach(var value in bunch)
{
try
{
int c = _countHash[value];
_countHash[value] = c + 1;
}
catch(KeyNotFoundException)
{
_countHash[value] = 0;
}
}
}
Then again there is the Dictionary::TryGetValue() method but it depends what that does inside :-) Use the Source
Upvotes: 0