Alex
Alex

Reputation: 5257

Binary search with comparer is faster than without

I have a data that consists of about 2 million records. I am trying to find the single data, which is closest to the given timeframe. The list of data is ordered and the data is represented by the following class:

    public class DataPoint
    {
        public long OpenTimeTs;
    }

I have implemented 3 methods that do the same job and produce the same results. I have some questions about why one of the approaches performs faster

Method 1

uses the binary search within the list of long

        private DataPoint BinaryFindClosest(List<DataPoint> candles, List<long> times, long dateToFindMs)
        {
            int index = times.BinarySearch(dateToFindMs);

            if (index >= 0)
                return candles[index];

            // If not found, List.BinarySearch returns the complement 
            // of the index where the element should have been.
            index = ~index;

            // This date search for is larger than any
            if (index == times.Count)
                return candles[index - 1];

            // The date searched is smaller than any in the list.
            if (index == 0)
                return candles[0];

            if (Math.Abs(dateToFindMs - times[index - 1]) < Math.Abs(dateToFindMs - times[index]))
                return candles[index - 1];
            else
                return candles[index];
        }

Method 2

Almost same as method 1, except it uses custom object comparer.

        private DataPoint BinaryFindClosest2(List<DataPoint> candles, DataPoint toFind)
        {
            var comparer = Comparer<DataPoint>.Create((x, y) => x.OpenTimeTs > y.OpenTimeTs ? 1 : x.OpenTimeTs < y.OpenTimeTs ? -1 : 0);

            int index = candles.BinarySearch(toFind, comparer);

            if (index >= 0)
                return candles[index];

            // If not found, List.BinarySearch returns the complement 
            // of the index where the element should have been.
            index = ~index;

            // This date search for is larger than any
            if (index == candles.Count)
                return candles[index - 1];

            // The date searched is smaller than any in the list.
            if (index == 0)
                return candles[0];

            if (Math.Abs(toFind.OpenTimeTs - candles[index - 1].OpenTimeTs) < Math.Abs(toFind.OpenTimeTs - candles[index].OpenTimeTs))
                return candles[index - 1];
            else
                return candles[index];
        }

Method 3

Finally this is the method I've been using before discovering the BinarySearch approach on stackoverflow in some other topic.

        private DataPoint FindClosest(List<DataPoint> candles, DataPoint toFind)
        {
            long timeToFind = toFind.OpenTimeTs;

            int smallestDistanceIdx = -1;
            long smallestDistance = long.MaxValue;

            for (int i = 0; i < candles.Count(); i++)
            {
                var candle = candles[i];
                var distance = Math.Abs(candle.OpenTimeTs - timeToFind);
                if (distance <= smallestDistance)
                {
                    smallestDistance = distance;
                    smallestDistanceIdx = i;
                }
                else
                {
                    break;
                }
            }

            return candles[smallestDistanceIdx];
        }

Question

Now here comes the problem. After running some benchmarks, it has come to my attention that the second method (which uses the custom comprarer) is the fastest amonght the others.

I would like to know why the approach with the custom comparer is performing faster than the approach that binary-searches within the list of longs.

I am using the following code to test the methods:

            var candles = AppState.GetLoadSymbolData();
            var times = candles.Select(s => s.OpenTimeTs).ToList();

            var dateToFindMs = candles[candles.Count / 2].OpenTimeTs;
            var candleToFind = new DataPoint() { OpenTimeTs = dateToFindMs };

            var numberOfFinds = 100_000;

            var sw = Stopwatch.StartNew();
            for (int i = 0; i < numberOfFinds; i++)
            {
                var foundCandle = BinaryFindClosest(candles, times, dateToFindMs);
            }
            sw.Stop();
            var elapsed1 = sw.ElapsedMilliseconds;

            sw.Restart();
            for (int i = 0; i < numberOfFinds; i++)
            {
                var foundCandle = BinaryFindClosest2(candles, candleToFind);
            }
            sw.Stop();
            var elapsed2 = sw.ElapsedMilliseconds;
            
            sw.Restart();
            for (int i = 0; i < numberOfFinds; i++)
            {
                var foundCandle = FindClosest(candles, candleToFind);
            }
            sw.Stop();
            var elapsed3 = sw.ElapsedMilliseconds;

            Console.WriteLine($"Elapsed 1: {elapsed1} ms");
            Console.WriteLine($"Elapsed 2: {elapsed2} ms");
            Console.WriteLine($"Elapsed 3: {elapsed3} ms");

In release mode, the results are following:

Logically I would assume that it should be faster to compare the list of longs, but this is not the case. I tried profiling the code, but it only points to BinarySearch method slow execution, nothing else.. So there must be some internal processes that slow things down for longs.

Edit: After following the advice I have implemented a proper benchmark test using benchmarkdotnet and here are the results

Method N Mean Error StdDev Gen0 Allocated
BinaryFindClosest 10000 28.31 ns 0.409 ns 0.362 ns - -
BinaryFindClosest2 10000 75.85 ns 0.865 ns 0.722 ns 0.0014 24 B
FindClosest 10000 3,363,223.68 ns 63,300.072 ns 52,858.427 ns - 2 B

It does look like order in which methods are executed messed up my initial result. Now it looks like the first method works faster (and it should be). The slowest is of course my own implementation. I have tuned it a bit, but it still is the slowest method:

        public static DataPoint FindClosest(List<DataPoint> candles, List<long> times, DataPoint toFind)
        {
            long timeToFind = toFind.OpenTimeTs;

            int smallestDistanceIdx = -1;
            long smallestDistance = long.MaxValue;

            var count = candles.Count();
            for (int i = 0; i < count; i++)
            {
                var diff = times[i] - timeToFind;
                var distance = diff < 0 ? -diff : diff;
                if (distance < smallestDistance)
                {
                    smallestDistance = distance;
                    smallestDistanceIdx = i;
                }
                else
                {
                    break;
                }
            }

            return candles[smallestDistanceIdx];
        }

To make the long story short - use a proper benchmarking tool.

Upvotes: 0

Views: 116

Answers (1)

AlessandroParma
AlessandroParma

Reputation: 141

Please give a look at the IL that the Method 1 and 2 generate. It is likely an invalid test. They should be almost the same machine code.

First: i don't see where you guarantee the ordering. But suppose it is there somehow. Binary search will find the most hidden number in almost 20 to 25 steps (log2(2.000.000)). This test smells weird.

Second: where is the definition of BinaryFindClosestCandle(candles, times, dateToFindMs)? Why is it receiving both the class instances and the list of longs? Why don't you return the index applying the binary search on the long list and use it to index the original list of candles? (if you create the list of longs with select, the 1:1 relation in the lists is preserved)

Third: The data you are using is a class, so that all elements live on the heap. You are boxing an array of 2 million long numbers in method2. It is almost a crime. Deferencing data from the heap will cost much more than the comparison itself. I still think that the lists are not ordered.

Create a swap list to apply the seach algo on, as you did with times, but convert it to an array with a .ToArray() instead and let it on the stack. I don't think there can be anything better on the market than the default comparer of long valueTypes.

EDIT FOR SOLUTION HINT: Depending on how many insertions you do before one lookup for the minimum value I would go for the following:

if (insertions/lookups > 300.000)
{
    a. store the index of the minimum (and the minimum value) apart in a dedicated field, I would store also a flag for IsUpdated to get false at the first deletion from the list.
    b. spawn a parallel thread to refresh that index and the minumum value at every now an then (depending on how often you do the lookups) if the IsUpdated is false, or lazily when you start a lookup with a IsUpdated = false.
}
else
{
    use a dictionary with the long as a key ( I suppose that two entities with the same long value are likely to be considered equal).
}

Upvotes: 1

Related Questions