TomDestry
TomDestry

Reputation: 3409

Fastest way to populate a Hashset

I need to regularly step through a large collection of objects and maintain the unique values of a particular String property within.

I'm using a Hashset to hold the unique values, but was wondering if it's more efficient to check if a value exists in the Hashset, or just attempt to add all the values?

Upvotes: 0

Views: 1862

Answers (3)

TomDestry
TomDestry

Reputation: 3409

As my original answer was generally derided, I've had another go.

Int32 maxUniques = 1;
Int32 collectionSize = 100000000;
Random rand = new Random();

while (maxUniques <= collectionSize)
{
    List<Int32> bigCollection = new List<Int32>();
    bigCollection.Capacity = collectionSize;

    for (Int32 count = 0; count < collectionSize; count++)
        bigCollection.Add(rand.Next(maxUniques));

    HashSet<Int32> uniqueSources = new HashSet<Int32>();

    Stopwatch watch = new Stopwatch();
    watch.Start();

    foreach (Int32 num in bigCollection)
    {
        if (!uniqueSources.Contains(num))
            uniqueSources.Add(num);
    }

    Console.WriteLine(String.Format("With {0,10:N0} unique values in a set of {1,10:N0} values, the time taken for conditional add: {2,6:N0} ms", uniqueSources.Count, collectionSize, watch.ElapsedMilliseconds));

    uniqueSources = new HashSet<Int32>();
    watch.Restart();

    foreach (Int32 num in bigCollection)
    {
        uniqueSources.Add(num);
    }

    Console.WriteLine(String.Format("With {0,10:N0} unique values in a set of {1,10:N0} values, the time taken for simple add:      {2,6:N0} ms", uniqueSources.Count, collectionSize, watch.ElapsedMilliseconds));
    Console.WriteLine();

    maxUniques *= 10;
}

Which gave the following output:

With 1 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,004 ms With 1 unique values in a set of 100,000,000 values, the time taken for simple add: 2,540 ms

With 10 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,066 ms With 10 unique values in a set of 100,000,000 values, the time taken for simple add: 2,391 ms

With 100 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,057 ms With 100 unique values in a set of 100,000,000 values, the time taken for simple add: 2,410 ms

With 1,000 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,011 ms With 1,000 unique values in a set of 100,000,000 values, the time taken for simple add: 2,459 ms

With 10,000 unique values in a set of 100,000,000 values, the time taken for conditional add: 2,219 ms
With 10,000 unique values in a set of 100,000,000 values, the time taken for simple add: 2,414 ms

With 100,000 unique values in a set of 100,000,000 values, the time taken for conditional add: 3,024 ms
With 100,000 unique values in a set of 100,000,000 values, the time taken for simple add: 3,124 ms

With 1,000,000 unique values in a set of 100,000,000 values, the time taken for conditional add: 8,937 ms
With 1,000,000 unique values in a set of 100,000,000 values, the time taken for simple add: 9,310 ms

With 9,999,536 unique values in a set of 100,000,000 values, the time taken for conditional add: 11,798 ms
With 9,999,536 unique values in a set of 100,000,000 values, the time taken for simple add: 11,660 ms

With 63,199,938 unique values in a set of 100,000,000 values, the time taken for conditional add: 20,847 ms
With 63,199,938 unique values in a set of 100,000,000 values, the time taken for simple add: 20,213 ms

Which is curious to me.

Up to 1% additions, it is faster to call the Contains() method rather than just keep hitting the Add(). For 10% and 63%, it was faster to just Add().

To put it another way:
100 million Contains() is faster than 99 million Add()
100 million Contains() is slower than 90 million Add()

I adjusted the code to try 1 million to 10 million unique values in 1 million increments and discovered the inflection point is somewhere around 7-10%, the results weren't conclusive.

So if you're expecting less than 7% of values to be added, it's faster to call Contains() first. More than 7%, just call Add().

Upvotes: 1

Scott Chamberlain
Scott Chamberlain

Reputation: 127603

Your test is a bad test for the reasons that Jon Hanna stated and did not give you accurate results. When you call Add internally HashSet calls AddIfNotPresent and the first thing AddIfNotPresent does is check if the object exists (code gotten from ILSpy)

public bool Add(T item)
{
    return this.AddIfNotPresent(item);
}

private bool AddIfNotPresent(T value)
{
    if (this.m_buckets == null)
    {
        this.Initialize(0);
    }
    int num = this.InternalGetHashCode(value);
    int num2 = num % this.m_buckets.Length;
    int num3 = 0;
    for (int i = this.m_buckets[num % this.m_buckets.Length] - 1; i >= 0; i = this.m_slots[i].next)
    {
        if (this.m_slots[i].hashCode == num && this.m_comparer.Equals(this.m_slots[i].value, value))
        {
            return false;
        }
        num3++;
    }
    //(Snip)

So by doing Contains then Add you do a check to see if the object exists twice. If you have many items in the bucket it is checking this could add up to a significant performance loss.

Upvotes: 2

TomDestry
TomDestry

Reputation: 3409

As I was typing the question it occurred that someone would ask why I did't just test it myself. So I've tested it myself.

I created a collection with a 1.26 million records and 21 unique source strings and ran it through the following code:

HashSet<String> uniqueSources = new HashSet<String>();

Stopwatch watch = new Stopwatch();
watch.Start();

foreach (LoggingMessage mess in bigCollection)
{
    uniqueSources.Add(mess.Source);
}

Console.WriteLine(String.Format("Time taken for simple add: {0}ms", watch.ElapsedMilliseconds));

uniqueSources.Clear();
watch.Restart();

foreach (LoggingMessage mess in bigCollection)
{
    if (!uniqueSources.Contains(mess.Source))
        uniqueSources.Add(mess.Source);
}

Console.WriteLine(String.Format("Time taken for conditional add: {0}ms", watch.ElapsedMilliseconds));

With the results that:

Time taken for simple add: 147ms

Time taken for conditional add: 125ms

So for my data at least, checking for existence doesn't slow things down, it is actually slightly faster. The difference it pretty small either way, though.

Upvotes: -1

Related Questions