theUser
theUser

Reputation: 1396

What is the Most Efficient way to compare large List of integers to smaller List of integers?

At the moment I have a list of 1million integers, and I check each integer against a blacklist of 2000 integers. This is taking about 2 minutes.

for(int i = 0; i< MillionIntegerList.Length ; i++)
{
    for(int blacklisted = 0; blacklisted < TwoThousandIntegerList.Length ; blacklisted++)
        if(i==blacklisted)
            i = 0; //Zero is a sentinel value 
}

This makes 2,000,000,000 iterations(loops) altogether. Is there a better way Im not seeing? thanks

Upvotes: 29

Views: 3411

Answers (8)

rrufai
rrufai

Reputation: 1495

How about doing a binary search on the longer list, since it's sorted.

foreach(integer blacklisted in TwoThousandIntegerList)
{
    integer i  = MillionIntegerList.binarySearch(blacklisted)
    if(i==blacklisted){
          //Do your stuff
    } 
}

This solution only costs O(m log n) time, where m is the size of the small list and n is the size of the longer list. Caveat: This solution assumes the MillionIntegerList has no duplicates values.

If that isn't the case, then you can just iterate though the repeats since they must lie in a contiguous block. For this, I'm going to assume that the MillionInterList is a list of records, each with a value and an index.

foreach(integer blacklisted in TwoThousandIntegerList)
{
    integer index = MillionIntegerList.binarySearch(blacklisted)
    //Find the index of the first occurrence of blacklisted value
    while(index > 0 && MillionIntegerList[index - 1].value == blacklisted){
          --index;
    }
    while(MillionIntegerList[index].value == blacklisted){
          //Do your stuff
          ++index;
    } 
}

This solution costs O(m log n + mk) where k is the average number of duplicates per blacklisted integer found in MillionInterList.

Upvotes: 1

Thomash
Thomash

Reputation: 6379

I think the best solution is to use a Bloom filter and it case the Bloom filter says an element may be in the blacklist, just check if is not a false positive (which can be done in O(Log(n)) if the blacklist is sorted). This solution is time efficient and uses almost no additional space which makes it far better than using a hashset.

This is the solution Google uses for the blacklist in Chrome.

Upvotes: 1

Jon Skeet
Jon Skeet

Reputation: 1500923

Three options now - the first two are more general, in that they don't rely on MillionIntegerList being sorted (which wasn't originally specified). The third is preferable in the case where the large list is already sorted.

Option 1

Yes, there's definitely a better way of doing it, using LINQ:

var common = MillionIntegerList.Intersect(TwoThousandIntegerList).ToList();

That will internally use a HashSet<int> built via the TwoThousandIntegerList, then look up each element of MillionIntegerList within it - which will be much more efficient than going through the whole of TwoThousandIntegerList each time.

If you only want the non-blacklisted ones, you need:

var valid = MillionIntegerList.Except(TwoThousandIntegerList).ToList();

Note that if you only need to iterate over the results once, you should remove the ToList call - I've included it to materialize the results so they can be examined multiple times cheaply. If you're just iterating, the return value of Intersect or Except will just stream the results, making it much cheaper in terms of memory usage.

Option 2

If you don't want to rely on the implementation details of LINQ to Objects but you still want a hash-based approach:

var hashSet = new HashSet<int>(TwoThousandIntegerList);
hashSet.IntersectWith(MillionIntegerList);
// Now use hashSet

Option 3

The approach of using the fact that the large list is sorted would definitely be useful.

Assuming you don't mind sorting the blacklisted list first as well, you could write a streaming (and general purpose) implementation like this (untested):

// Note: to use this, you'd need to make sure that *both* sequences are sorted.
// You could either sort TwoThousandIntegerList in place, or use LINQ's OrderBy
// method.

public IEnumerable<T> SortedIntersect<T>(this IEnumerable<T> first,
    IEnumerable<T> second) where T : IComparable<T>
{
    using (var firstIterator = first.GetEnumerator())
    {
        if (!firstIterator.MoveNext())
        {
            yield break;
        }

        using (var secondIterator = second.GetEnumerator())
        {
            if (!secondIterator.MoveNext())
            {
                yield break;
            }
            T firstValue = firstIterator.Current;
            T secondValue = secondIterator.Current;

            while (true)
            {
                int comparison = firstValue.CompareTo(secondValue);
                if (comparison == 0) // firstValue == secondValue
                {
                    yield return firstValue;
                }
                else if (comparison < 0) // firstValue < secondValue
                {
                    if (!firstIterator.MoveNext())
                    {
                        yield break;
                    }
                    firstValue = firstIterator.Current;
                }
                else // firstValue > secondValue
                {
                    if (!secondIterator.MoveNext())
                    {
                        yield break;
                    }
                    secondValue = secondIterator.Current;
                }  
            }                
        }
    }
}

(You could take an IComparer<T> if you wanted instead of relying on T being comparable.)

Upvotes: 50

Oleg Golovkov
Oleg Golovkov

Reputation: 706

Your approach requires O(n*n) time. Consider these optimizations:

  • 1)

    If your integers are not too large, you can use array of bool (for example if the biggest possible integer is 1000000 use bool[] b = new bool[1000000]). Now to add a number K to blacklist use b[K] = true. Check is trivial. This works in O(n). You can also use BitArray

  • 2)

    Integers can be big enough. Use binary search tree for storing blacklist (for example SortedSet). it has O(logN) insert and retrieve time. So in all it is O(N*logN). The syntax is just the same as for List (Add(int K), Contains(int K)), duplicates are ignored

Upvotes: 3

Daniel Renshaw
Daniel Renshaw

Reputation: 34187

Since the large list is sorted. You might get the best results by sorting the small list (very quick) and then doing a linear merge. You'll only need to look at each item in the large (and small) list once, and there will be no need for a Hashtable to be created in the background.

See the merge function part of MergeSort for an idea on how to do this.

Upvotes: 17

Kishore Kumar
Kishore Kumar

Reputation: 12874

use Except method for List. This will work

Upvotes: -2

isioutis
isioutis

Reputation: 324

What you need is Enumerable.Except Method (IEnumerable, IEnumerable) in my opinion

check here http://msdn.microsoft.com/en-us/library/bb300779.aspx

Upvotes: 5

Sandeep
Sandeep

Reputation: 7334

Use a HashSet for blocked list.

foreach(integer i in MillionIntegerList)
{
        //check if blockedlist contains i
        //do what ever you like. 
}

Upvotes: 0

Related Questions