Reputation: 1396
At the moment I have a list
of 1million integers
, and I check each integer
against a blacklist of 2000 integer
s. This is taking about 2 minutes.
for(int i = 0; i< MillionIntegerList.Length ; i++)
{
for(int blacklisted = 0; blacklisted < TwoThousandIntegerList.Length ; blacklisted++)
if(i==blacklisted)
i = 0; //Zero is a sentinel value
}
This makes 2,000,000,000 iterations(loops) altogether. Is there a better way Im not seeing? thanks
Upvotes: 29
Views: 3411
Reputation: 1495
How about doing a binary search on the longer list, since it's sorted.
foreach(integer blacklisted in TwoThousandIntegerList)
{
integer i = MillionIntegerList.binarySearch(blacklisted)
if(i==blacklisted){
//Do your stuff
}
}
This solution only costs O(m log n) time, where m is the size of the small list and n is the size of the longer list. Caveat: This solution assumes the MillionIntegerList
has no duplicates values.
If that isn't the case, then you can just iterate though the repeats since they must lie in a contiguous block. For this, I'm going to assume that the MillionInterList
is a list of records, each with a value
and an index
.
foreach(integer blacklisted in TwoThousandIntegerList)
{
integer index = MillionIntegerList.binarySearch(blacklisted)
//Find the index of the first occurrence of blacklisted value
while(index > 0 && MillionIntegerList[index - 1].value == blacklisted){
--index;
}
while(MillionIntegerList[index].value == blacklisted){
//Do your stuff
++index;
}
}
This solution costs O(m log n + mk) where k is the average number of duplicates per blacklisted integer found in MillionInterList
.
Upvotes: 1
Reputation: 6379
I think the best solution is to use a Bloom filter and it case the Bloom filter says an element may be in the blacklist, just check if is not a false positive (which can be done in O(Log(n)) if the blacklist is sorted). This solution is time efficient and uses almost no additional space which makes it far better than using a hashset.
This is the solution Google uses for the blacklist in Chrome.
Upvotes: 1
Reputation: 1500923
Three options now - the first two are more general, in that they don't rely on MillionIntegerList
being sorted (which wasn't originally specified). The third is preferable in the case where the large list is already sorted.
Option 1
Yes, there's definitely a better way of doing it, using LINQ:
var common = MillionIntegerList.Intersect(TwoThousandIntegerList).ToList();
That will internally use a HashSet<int>
built via the TwoThousandIntegerList
, then look up each element of MillionIntegerList
within it - which will be much more efficient than going through the whole of TwoThousandIntegerList
each time.
If you only want the non-blacklisted ones, you need:
var valid = MillionIntegerList.Except(TwoThousandIntegerList).ToList();
Note that if you only need to iterate over the results once, you should remove the ToList
call - I've included it to materialize the results so they can be examined multiple times cheaply. If you're just iterating, the return value of Intersect
or Except
will just stream the results, making it much cheaper in terms of memory usage.
Option 2
If you don't want to rely on the implementation details of LINQ to Objects but you still want a hash-based approach:
var hashSet = new HashSet<int>(TwoThousandIntegerList);
hashSet.IntersectWith(MillionIntegerList);
// Now use hashSet
Option 3
The approach of using the fact that the large list is sorted would definitely be useful.
Assuming you don't mind sorting the blacklisted list first as well, you could write a streaming (and general purpose) implementation like this (untested):
// Note: to use this, you'd need to make sure that *both* sequences are sorted.
// You could either sort TwoThousandIntegerList in place, or use LINQ's OrderBy
// method.
public IEnumerable<T> SortedIntersect<T>(this IEnumerable<T> first,
IEnumerable<T> second) where T : IComparable<T>
{
using (var firstIterator = first.GetEnumerator())
{
if (!firstIterator.MoveNext())
{
yield break;
}
using (var secondIterator = second.GetEnumerator())
{
if (!secondIterator.MoveNext())
{
yield break;
}
T firstValue = firstIterator.Current;
T secondValue = secondIterator.Current;
while (true)
{
int comparison = firstValue.CompareTo(secondValue);
if (comparison == 0) // firstValue == secondValue
{
yield return firstValue;
}
else if (comparison < 0) // firstValue < secondValue
{
if (!firstIterator.MoveNext())
{
yield break;
}
firstValue = firstIterator.Current;
}
else // firstValue > secondValue
{
if (!secondIterator.MoveNext())
{
yield break;
}
secondValue = secondIterator.Current;
}
}
}
}
}
(You could take an IComparer<T>
if you wanted instead of relying on T being comparable.)
Upvotes: 50
Reputation: 706
Your approach requires O(n*n) time. Consider these optimizations:
1)
If your integers are not too large, you can use array of bool (for example if the biggest possible integer is 1000000 use bool[] b = new bool[1000000]). Now to add a number K to blacklist use b[K] = true. Check is trivial. This works in O(n). You can also use BitArray
2)
Integers can be big enough. Use binary search tree for storing blacklist (for example SortedSet). it has O(logN) insert and retrieve time. So in all it is O(N*logN). The syntax is just the same as for List (Add(int K), Contains(int K)), duplicates are ignored
Upvotes: 3
Reputation: 34187
Since the large list is sorted. You might get the best results by sorting the small list (very quick) and then doing a linear merge. You'll only need to look at each item in the large (and small) list once, and there will be no need for a Hashtable to be created in the background.
See the merge function part of MergeSort for an idea on how to do this.
Upvotes: 17
Reputation: 324
What you need is Enumerable.Except Method (IEnumerable, IEnumerable) in my opinion
check here http://msdn.microsoft.com/en-us/library/bb300779.aspx
Upvotes: 5
Reputation: 7334
Use a HashSet for blocked list.
foreach(integer i in MillionIntegerList)
{
//check if blockedlist contains i
//do what ever you like.
}
Upvotes: 0