Reputation: 3409
I need to regularly step through a large collection of objects and maintain the unique values of a particular String property within.
I'm using a Hashset to hold the unique values, but was wondering if it's more efficient to check if a value exists in the Hashset, or just attempt to add all the values?
Upvotes: 0
Views: 1862
Reputation: 3409
As my original answer was generally derided, I've had another go.
Int32 maxUniques = 1;
Int32 collectionSize = 100000000;
Random rand = new Random();
while (maxUniques <= collectionSize)
{
List<Int32> bigCollection = new List<Int32>();
bigCollection.Capacity = collectionSize;
for (Int32 count = 0; count < collectionSize; count++)
bigCollection.Add(rand.Next(maxUniques));
HashSet<Int32> uniqueSources = new HashSet<Int32>();
Stopwatch watch = new Stopwatch();
watch.Start();
foreach (Int32 num in bigCollection)
{
if (!uniqueSources.Contains(num))
uniqueSources.Add(num);
}
Console.WriteLine(String.Format("With {0,10:N0} unique values in a set of {1,10:N0} values, the time taken for conditional add: {2,6:N0} ms", uniqueSources.Count, collectionSize, watch.ElapsedMilliseconds));
uniqueSources = new HashSet<Int32>();
watch.Restart();
foreach (Int32 num in bigCollection)
{
uniqueSources.Add(num);
}
Console.WriteLine(String.Format("With {0,10:N0} unique values in a set of {1,10:N0} values, the time taken for simple add: {2,6:N0} ms", uniqueSources.Count, collectionSize, watch.ElapsedMilliseconds));
Console.WriteLine();
maxUniques *= 10;
}
Which gave the following output:
With 1 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,004 ms With 1 unique values in a set of 100,000,000 values, the time taken for simple add: 2,540 ms
With 10 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,066 ms With 10 unique values in a set of 100,000,000 values, the time taken for simple add: 2,391 ms
With 100 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,057 ms With 100 unique values in a set of 100,000,000 values, the time taken for simple add: 2,410 ms
With 1,000 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,011 ms With 1,000 unique values in a set of 100,000,000 values, the time taken for simple add: 2,459 ms
With 10,000 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,219 ms
With 10,000 unique values in a set of 100,000,000 values, the time taken for simple add: 2,414 msWith 100,000 unique values in a set of 100,000,000 values, the time taken for conditional add: 3,024 ms
With 100,000 unique values in a set of 100,000,000 values, the time taken for simple add: 3,124 msWith 1,000,000 unique values in a set of 100,000,000 values, the time taken for conditional add: 8,937 ms
With 1,000,000 unique values in a set of 100,000,000 values, the time taken for simple add: 9,310 msWith 9,999,536 unique values in a set of 100,000,000 values, the time taken for conditional add: 11,798 ms
With 9,999,536 unique values in a set of 100,000,000 values, the time taken for simple add: 11,660 msWith 63,199,938 unique values in a set of 100,000,000 values, the time taken for conditional add: 20,847 ms
With 63,199,938 unique values in a set of 100,000,000 values, the time taken for simple add: 20,213 ms
Which is curious to me.
Up to 1% additions, it is faster to call the Contains() method rather than just keep hitting the Add(). For 10% and 63%, it was faster to just Add().
To put it another way:
100 million Contains() is faster than 99 million Add()
100 million Contains() is slower than 90 million Add()
I adjusted the code to try 1 million to 10 million unique values in 1 million increments and discovered the inflection point is somewhere around 7-10%, the results weren't conclusive.
So if you're expecting less than 7% of values to be added, it's faster to call Contains() first. More than 7%, just call Add().
Upvotes: 1
Reputation: 127603
Your test is a bad test for the reasons that Jon Hanna stated and did not give you accurate results. When you call Add
internally HashSet calls AddIfNotPresent
and the first thing AddIfNotPresent
does is check if the object exists (code gotten from ILSpy)
public bool Add(T item)
{
return this.AddIfNotPresent(item);
}
private bool AddIfNotPresent(T value)
{
if (this.m_buckets == null)
{
this.Initialize(0);
}
int num = this.InternalGetHashCode(value);
int num2 = num % this.m_buckets.Length;
int num3 = 0;
for (int i = this.m_buckets[num % this.m_buckets.Length] - 1; i >= 0; i = this.m_slots[i].next)
{
if (this.m_slots[i].hashCode == num && this.m_comparer.Equals(this.m_slots[i].value, value))
{
return false;
}
num3++;
}
//(Snip)
So by doing Contains
then Add
you do a check to see if the object exists twice. If you have many items in the bucket it is checking this could add up to a significant performance loss.
Upvotes: 2
Reputation: 3409
As I was typing the question it occurred that someone would ask why I did't just test it myself. So I've tested it myself.
I created a collection with a 1.26 million records and 21 unique source strings and ran it through the following code:
HashSet<String> uniqueSources = new HashSet<String>();
Stopwatch watch = new Stopwatch();
watch.Start();
foreach (LoggingMessage mess in bigCollection)
{
uniqueSources.Add(mess.Source);
}
Console.WriteLine(String.Format("Time taken for simple add: {0}ms", watch.ElapsedMilliseconds));
uniqueSources.Clear();
watch.Restart();
foreach (LoggingMessage mess in bigCollection)
{
if (!uniqueSources.Contains(mess.Source))
uniqueSources.Add(mess.Source);
}
Console.WriteLine(String.Format("Time taken for conditional add: {0}ms", watch.ElapsedMilliseconds));
With the results that:
Time taken for simple add: 147ms
Time taken for conditional add: 125ms
So for my data at least, checking for existence doesn't slow things down, it is actually slightly faster. The difference it pretty small either way, though.
Upvotes: -1