Reputation: 2109
I have a HashSet<int>
and a List<int>
(Hashset has approximately 3 Million items, List has approximately 300k items).
I currently intersect them using
var intersected = hashset.Intersect(list).ToArray();
and I wonder if there is any faster way to do so. Maybe in parallel?
Upvotes: 9
Views: 6613
Reputation: 5213
HashSet
has a method IntersectWith
that is optimized if intersection is performed between two hash sets. Using method IntersectWith
we can intersect HashSet
and List
using the next approach:
private static IEnumerable<int> Intersect(HashSet<int> hash, List<int> list)
{
HashSet<int> intersect = new HashSet<int>(list);
intersect.IntersectWith(hash);
return intersect;
}
I have measured (using Stopwatch
) performance of your original method (Linq Intersect
), methods proposed by @TheodorZoulias (HashSet Contains
and HashSet Contains Parallel
) and my method (HashSet IntersectWith
). Here are results:
------------------------------------------------------------------------
| Method | Min, ms | Max, ms | Avg, ms | StdDev, ms |
------------------------------------------------------------------------
| Linq Intersect | 135 | 274 | 150 | 17 |
| HashSet Contains | 25 | 44 | 26 | 2 |
| HashSet Contains Parallel | 12 | 53 | 13 | 3 |
| HashSet IntersectWith | 57 | 89 | 61 | 4 |
------------------------------------------------------------------------
From the table we can see that the fastest method is HashSet Contains Parallel
and the slowest is Linq Intersect
.
Here is complete source code that was used to measure performance.
Upvotes: 8
Reputation: 43525
Yes, you can go faster because you have already a HashSet
in hand. The LINQ Intersect
uses a generic algorithm, that essentially recreates a HashSet
from scratch every time it's called. Here is a faster algorithm:
/// <summary>Yields all the elements of first (including duplicates) that also
/// appear in second, in the order in which they appear in first.</summary>
public static IEnumerable<TSource> Intersect<TSource>(IEnumerable<TSource> first,
HashSet<TSource> second)
{
foreach (TSource element in first)
{
if (second.Contains(element)) yield return element;
}
}
Update: Here is a parallel version of the above idea:
var intersected = list.AsParallel().Where(x => hashset.Contains(x)).ToArray();
I wouldn't expect it to be much faster, if at all, because the workload is too granular. The overhead of calling a lambda 300,000 times will probably overshadow any benefits of the parallelism.
Also the order of the results will not be preserved, unless the AsOrdered
PLINQ method is added in the query, hurting further the performance of the operation.
Upvotes: 1
Reputation: 32878
It might be faster for you to store lots of integers as a compact bit set rather than as a HashSet
or List
(at least if you're using List
to store unique integers just like HashSet
). In this sense, there are several choices:
BitArray
stores each bit in a compact way. As an example, if you're storing integers from 1 through 65000, BitArray
requires about 8125 bytes of memory (as opposed to 65000 bytes if each bit were stored as an 8-bit byte). However, BitArray
may not be very memory-efficient if the highest set bit is very large (e.g., 3 billion), or if the set of bits is sparse (there are huge areas with set bits and/or clear bits). You can intersect two BitArray
s using the Xor
methodFixedBitSet
) in terms of performance and memory (note that they compare Java implementations, but they may still be useful in the .NET case).Upvotes: 0