user2033412
user2033412

Reputation: 2109

Fast intersection of HashSet<int> and List<int>

I have a HashSet<int> and a List<int> (Hashset has approximately 3 Million items, List has approximately 300k items).

I currently intersect them using

var intersected = hashset.Intersect(list).ToArray();

and I wonder if there is any faster way to do so. Maybe in parallel?

Upvotes: 9

Views: 6613

Answers (3)

Iliar Turdushev
Iliar Turdushev

Reputation: 5213

HashSet has a method IntersectWith that is optimized if intersection is performed between two hash sets. Using method IntersectWith we can intersect HashSet and List using the next approach:

private static IEnumerable<int> Intersect(HashSet<int> hash, List<int> list)
{
    HashSet<int> intersect = new HashSet<int>(list);
    intersect.IntersectWith(hash);
    return intersect;
}

I have measured (using Stopwatch) performance of your original method (Linq Intersect), methods proposed by @TheodorZoulias (HashSet Contains and HashSet Contains Parallel) and my method (HashSet IntersectWith). Here are results:

------------------------------------------------------------------------
|         Method            | Min, ms | Max, ms | Avg, ms | StdDev, ms |
------------------------------------------------------------------------
| Linq Intersect            |   135   |   274   |   150   |     17     |
| HashSet Contains          |    25   |    44   |    26   |      2     |
| HashSet Contains Parallel |    12   |    53   |    13   |      3     |
| HashSet IntersectWith     |    57   |    89   |    61   |      4     |
------------------------------------------------------------------------

From the table we can see that the fastest method is HashSet Contains Parallel and the slowest is Linq Intersect.


Here is complete source code that was used to measure performance.

Upvotes: 8

Theodor Zoulias
Theodor Zoulias

Reputation: 43525

Yes, you can go faster because you have already a HashSet in hand. The LINQ Intersect uses a generic algorithm, that essentially recreates a HashSet from scratch every time it's called. Here is a faster algorithm:

/// <summary>Yields all the elements of first (including duplicates) that also
/// appear in second, in the order in which they appear in first.</summary>
public static IEnumerable<TSource> Intersect<TSource>(IEnumerable<TSource> first,
    HashSet<TSource> second)
{
    foreach (TSource element in first)
    {
        if (second.Contains(element)) yield return element;
    }
}

Update: Here is a parallel version of the above idea:

var intersected = list.AsParallel().Where(x => hashset.Contains(x)).ToArray();

I wouldn't expect it to be much faster, if at all, because the workload is too granular. The overhead of calling a lambda 300,000 times will probably overshadow any benefits of the parallelism.

Also the order of the results will not be preserved, unless the AsOrdered PLINQ method is added in the query, hurting further the performance of the operation.

Upvotes: 1

Peter O.
Peter O.

Reputation: 32878

It might be faster for you to store lots of integers as a compact bit set rather than as a HashSet or List (at least if you're using List to store unique integers just like HashSet). In this sense, there are several choices:

  • The built-in BitArray stores each bit in a compact way. As an example, if you're storing integers from 1 through 65000, BitArray requires about 8125 bytes of memory (as opposed to 65000 bytes if each bit were stored as an 8-bit byte). However, BitArray may not be very memory-efficient if the highest set bit is very large (e.g., 3 billion), or if the set of bits is sparse (there are huge areas with set bits and/or clear bits). You can intersect two BitArrays using the Xor method
  • Compressed bit sets likewise store each bit in a compact way, but also compress parts of themselves to further save memory while still keeping set operations such as intersection efficient. Examples include Elias-Fano encoding, Roaring Bitmaps, and EWAH. See graphs comparing different implementations of compressed bit sets with uncompressed (FixedBitSet) in terms of performance and memory (note that they compare Java implementations, but they may still be useful in the .NET case).

Upvotes: 0

Related Questions