thatandrey
thatandrey

Reputation: 287

Binary search slower, what am I doing wrong?

EDIT: so it looks like this is normal behavior, so can anyone just recommend a faster way to do these numerous intersections?

so my problem is this. I have 8000 lists (strings in each list). For each list (ranging from size 50 to 400), I'm comparing it to every other list and performing a calculation based on the intersection number. So I'll do

list1(intersect)list1= number

list1(intersect)list2= number

list1(intersect)list888= number

And I do this for every list. Previously, I had HashList and my code was essentially this: (well, I was actually searching through properties of an object, so I had to modify the code a bit, but it's basically this:

I have my two versions below, but if anyone knows anything faster, please let me know!

Loop through AllLists, getting each list, starting with list1, and then do this:

foreach (List list in AllLists)
{
    if (list1_length < list_length) //just a check to so I'm looping through the                  
                                    //smaller list
    {
        foreach (string word in list1)
        {
            if (block.generator_list.Contains(word))
            {
                //simple integer count
            }
        }
    }
// a little more code, but the same, but looping through the other list if it's smaller/bigger

Then I make the lists into regular lists, and applied Sort(), which changed my code to

foreach (List list in AllLists)
{
    if (list1_length < list_length) //just a check to so I'm looping through the                  
                                    //smaller list
    {
        for (int i = 0; i < list1_length; i++)
        {
            var test = list.BinarySearch(list1[i]);
            if (test > -1)
            {
                //simple integer count
            }
        }
    }

The first version takes about 6 seconds, the other one takes more than 20 (I just stop there cuz otherwise it would take more than a minute!!!) (and this is for a smallish subset of the data)

I'm sure there's a drastic mistake somewhere, but I can't find it.

Upvotes: 0

Views: 292

Answers (2)

Assaf
Assaf

Reputation: 1370

Well I have tried three distinct methods for achieving this (assuming I understood the problem correctly). Please note I have used HashSet<int> in order to more easily generate random input. setting up:

List<HashSet<int>> allSets = new List<HashSet<int>>();
Random rand = new Random();
for(int i = 0; i < 8000; ++i) {
    HashSet<int> ints = new HashSet<int>();
    for(int j = 0; j < rand.Next(50, 400); ++j) {
        ints.Add(rand.Next(0, 1000));
    }
    allSets.Add(ints);
}

the three methods I checked (code is what runs in the inner loop):

the loop:

note that you are getting duplicated results in your code (intersecting set A with set B and later intersecting set B with set A). It won't affect your performance thanks to the list length check you are doing. But iterating this way is clearer.

for(int i = 0; i < allSets.Count; ++i) {
    for(int j = i + 1; j < allSets.Count; ++j) {

    }
}

first method:

used IEnumerable.Intersect() to get the intersection with the other list and checked IEnumerable.Count() to get the size of the intersection.

var intersect = allSets[i].Intersect(allSets[j]);
count = intersect.Count();

this was the slowest one averaging 177s

second method:

cloned the smaller set of the two sets I was intersecting, then used ISet.IntersectWith() and checked the resulting sets Count.

HashSet<int> intersect;
HashSet<int> intersectWith;
        if(allSets[i].Count < allSets[j].Count) {
            intersect = new HashSet<int>(allSets[i]);
            intersectWith = allSets[j];
        } else {
            intersect = new HashSet<int>(allSets[j]);
            intersectWith = allSets[i];
        }
        intersect.IntersectWith(intersectWith);
        count = intersect.Count;
    }
}

this one was slightly faster, averaging 154s

third method:

did something very similar to what you did iterated over the shorter set and checked ISet.Contains on the longer set.

for(int i = 0; i < allSets.Count; ++i) {
    for(int j = i + 1; j < allSets.Count; ++j) {
        count = 0;
        if(allSets[i].Count < allSets[j].Count) {
            loopingSet = allSets[i];
            containsSet = allSets[j];
        } else {
            loopingSet = allSets[j];
            containsSet = allSets[i];
        }
        foreach(int k in loopingSet) {
            if(containsSet.Contains(k)) {
                ++count;
            }
        }
    }
}

this method was by far the fastest (as expected), averaging 66s

conclusion

the method you're using is the fastest of these three. I certainly can't think of a faster single threaded way to do this. Perhaps there is a better concurrent solution.

Upvotes: 1

DvS
DvS

Reputation: 1095

I've found that one of the most important considerations in iterating/searching any kind of collection is to choose the collection type very carefully. To iterate through a normal collection for your purposes will not be the most optimal. Try using something like:

System.Collections.Generic.HashSet<T>

Using the Contains() method while iterating over the shorter list of two (as you mentioned you're already doing) should give close to O(1) performance, the same as key lookups in the generic Dictionary type.

Upvotes: 0

Related Questions