Reputation: 5257
I have a data that consists of about 2 million records. I am trying to find the single data, which is closest to the given timeframe. The list of data is ordered and the data is represented by the following class:
public class DataPoint
{
public long OpenTimeTs;
}
I have implemented 3 methods that do the same job and produce the same results. I have some questions about why one of the approaches performs faster
uses the binary search within the list of long
private DataPoint BinaryFindClosest(List<DataPoint> candles, List<long> times, long dateToFindMs)
{
int index = times.BinarySearch(dateToFindMs);
if (index >= 0)
return candles[index];
// If not found, List.BinarySearch returns the complement
// of the index where the element should have been.
index = ~index;
// This date search for is larger than any
if (index == times.Count)
return candles[index - 1];
// The date searched is smaller than any in the list.
if (index == 0)
return candles[0];
if (Math.Abs(dateToFindMs - times[index - 1]) < Math.Abs(dateToFindMs - times[index]))
return candles[index - 1];
else
return candles[index];
}
Almost same as method 1, except it uses custom object comparer.
private DataPoint BinaryFindClosest2(List<DataPoint> candles, DataPoint toFind)
{
var comparer = Comparer<DataPoint>.Create((x, y) => x.OpenTimeTs > y.OpenTimeTs ? 1 : x.OpenTimeTs < y.OpenTimeTs ? -1 : 0);
int index = candles.BinarySearch(toFind, comparer);
if (index >= 0)
return candles[index];
// If not found, List.BinarySearch returns the complement
// of the index where the element should have been.
index = ~index;
// This date search for is larger than any
if (index == candles.Count)
return candles[index - 1];
// The date searched is smaller than any in the list.
if (index == 0)
return candles[0];
if (Math.Abs(toFind.OpenTimeTs - candles[index - 1].OpenTimeTs) < Math.Abs(toFind.OpenTimeTs - candles[index].OpenTimeTs))
return candles[index - 1];
else
return candles[index];
}
Finally this is the method I've been using before discovering the BinarySearch
approach on stackoverflow in some other topic.
private DataPoint FindClosest(List<DataPoint> candles, DataPoint toFind)
{
long timeToFind = toFind.OpenTimeTs;
int smallestDistanceIdx = -1;
long smallestDistance = long.MaxValue;
for (int i = 0; i < candles.Count(); i++)
{
var candle = candles[i];
var distance = Math.Abs(candle.OpenTimeTs - timeToFind);
if (distance <= smallestDistance)
{
smallestDistance = distance;
smallestDistanceIdx = i;
}
else
{
break;
}
}
return candles[smallestDistanceIdx];
}
Now here comes the problem. After running some benchmarks, it has come to my attention that the second method (which uses the custom comprarer) is the fastest amonght the others.
I would like to know why the approach with the custom comparer is performing faster than the approach that binary-searches within the list of long
s.
I am using the following code to test the methods:
var candles = AppState.GetLoadSymbolData();
var times = candles.Select(s => s.OpenTimeTs).ToList();
var dateToFindMs = candles[candles.Count / 2].OpenTimeTs;
var candleToFind = new DataPoint() { OpenTimeTs = dateToFindMs };
var numberOfFinds = 100_000;
var sw = Stopwatch.StartNew();
for (int i = 0; i < numberOfFinds; i++)
{
var foundCandle = BinaryFindClosest(candles, times, dateToFindMs);
}
sw.Stop();
var elapsed1 = sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i < numberOfFinds; i++)
{
var foundCandle = BinaryFindClosest2(candles, candleToFind);
}
sw.Stop();
var elapsed2 = sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i < numberOfFinds; i++)
{
var foundCandle = FindClosest(candles, candleToFind);
}
sw.Stop();
var elapsed3 = sw.ElapsedMilliseconds;
Console.WriteLine($"Elapsed 1: {elapsed1} ms");
Console.WriteLine($"Elapsed 2: {elapsed2} ms");
Console.WriteLine($"Elapsed 3: {elapsed3} ms");
In release mode, the results are following:
Logically I would assume that it should be faster to compare the list of longs, but this is not the case. I tried profiling the code, but it only points to BinarySearch
method slow execution, nothing else.. So there must be some internal processes that slow things down for long
s.
Edit: After following the advice I have implemented a proper benchmark test using benchmarkdotnet
and here are the results
Method | N | Mean | Error | StdDev | Gen0 | Allocated |
---|---|---|---|---|---|---|
BinaryFindClosest | 10000 | 28.31 ns | 0.409 ns | 0.362 ns | - | - |
BinaryFindClosest2 | 10000 | 75.85 ns | 0.865 ns | 0.722 ns | 0.0014 | 24 B |
FindClosest | 10000 | 3,363,223.68 ns | 63,300.072 ns | 52,858.427 ns | - | 2 B |
It does look like order in which methods are executed messed up my initial result. Now it looks like the first method works faster (and it should be). The slowest is of course my own implementation. I have tuned it a bit, but it still is the slowest method:
public static DataPoint FindClosest(List<DataPoint> candles, List<long> times, DataPoint toFind)
{
long timeToFind = toFind.OpenTimeTs;
int smallestDistanceIdx = -1;
long smallestDistance = long.MaxValue;
var count = candles.Count();
for (int i = 0; i < count; i++)
{
var diff = times[i] - timeToFind;
var distance = diff < 0 ? -diff : diff;
if (distance < smallestDistance)
{
smallestDistance = distance;
smallestDistanceIdx = i;
}
else
{
break;
}
}
return candles[smallestDistanceIdx];
}
To make the long story short - use a proper benchmarking tool.
Upvotes: 0
Views: 116
Reputation: 141
Please give a look at the IL that the Method 1 and 2 generate. It is likely an invalid test. They should be almost the same machine code.
First: i don't see where you guarantee the ordering. But suppose it is there somehow. Binary search will find the most hidden number in almost 20 to 25 steps (log2(2.000.000)). This test smells weird.
Second: where is the definition of BinaryFindClosestCandle(candles, times, dateToFindMs)
? Why is it receiving both the class instances and the list of long
s? Why don't you return the index applying the binary search on the long list and use it to index the original list of candles? (if you create the list of long
s with select, the 1:1 relation in the lists is preserved)
Third: The data you are using is a class, so that all elements live on the heap. You are boxing an array of 2 million long numbers in method2. It is almost a crime. Deferencing data from the heap will cost much more than the comparison itself. I still think that the lists are not ordered.
Create a swap list to apply the seach algo on, as you did with times
, but convert it to an array with a .ToArray()
instead and let it on the stack. I don't think there can be anything better on the market than the default comparer of long
valueTypes.
EDIT FOR SOLUTION HINT: Depending on how many insertions you do before one lookup for the minimum value I would go for the following:
if (insertions/lookups > 300.000)
{
a. store the index of the minimum (and the minimum value) apart in a dedicated field, I would store also a flag for IsUpdated to get false at the first deletion from the list.
b. spawn a parallel thread to refresh that index and the minumum value at every now an then (depending on how often you do the lookups) if the IsUpdated is false, or lazily when you start a lookup with a IsUpdated = false.
}
else
{
use a dictionary with the long as a key ( I suppose that two entities with the same long value are likely to be considered equal).
}
Upvotes: 1