Reputation: 806
I have two large lists of object. First (about of 1 000 000 objects):
public class BaseItem
{
public BaseItem()
{
}
public double Fee { get; set; } = 0;
public string Market { get; set; } = string.Empty;
public string Traider { get; set; } = string.Empty;
public DateTime DateUtc { get; set; } = new DateTime();
}
Second (about of 20 000 objects):
public class TraiderItem
{
public TraiderItem()
{
}
public DateTime DateUtc { get; set; } = new DateTime();
public string Market { get; set; } = string.Empty;
public string Type { get; set; } = string.Empty;
public double Price { get; set; } = 0;
public double Amount { get; set; } = 0;
public double Total { get; set; } = 0;
public double Fee { get; set; } = 0;
public string FeeCoin { get; set; } = string.Empty;
}
I need to find all Traider
items in Base
items when DateUtc
are equals and Fee
are equals. Now i am using Any method:
traiderItemsInBase = traiderItems.Where(a => baseItems.Any(x => x.DateUtc == a.DateUtc && Math.Round(x.Fee, 8) == Math.Round((double)a.Fee * 0.4, 8))).ToList();
But this way is very-very slow. Is there a way to make this more efficient?Is there possibility to use HashSet in this case?
Upvotes: 6
Views: 1794
Reputation: 335
I have tried some suggestions and this is so far the fastest I could get:
private static void TestForPreCountingParallel(List<TraiderItem> traiderItems, List<BaseItem> baseItems)
{
var watch = new Stopwatch();
watch.Start();
ConcurrentBag<TraiderItem> traiderItemsInBase = null;
for (int i = 0; i < 3; i++)
{
traiderItemsInBase = new ConcurrentBag<TraiderItem>();
var baseFeesRounds = baseItems.Select(bi => Math.Round((double)bi.Fee * 0.4, 8)).ToArray();
Parallel.ForEach(traiderItems, traiderItem =>
{
double traiderFeeRound = Math.Round(traiderItem.Fee, 8);
for (var index = 0; index < baseItems.Count; index++)
{
var baseItem = baseItems[index];
if (traiderItem.DateUtc == baseItem.DateUtc && traiderFeeRound == baseFeesRounds[index])
{
traiderItemsInBase.Add(traiderItem);
break;
}
}
});
Console.WriteLine(i + "," + watch.ElapsedMilliseconds);
}
watch.Stop();
Console.WriteLine("base:{0},traid:{1},res:{2},time:{3}", baseItems.Count, traiderItems.Count,
traiderItemsInBase.Count, watch.ElapsedMilliseconds);
}
Anyone have another improvement?
For the things I have tried, it is like this:
Times are not significant, the trend is what to look at. The benchmark is not perfect, I haven't played much with ratio of TraiderItems inside BaseItems, my own is pretty low as you can see. 40 from 100000.
So just to see some different ratios:
And another:
I am not an expert, so I have to refer to other sources like: http://mattwarren.org/2016/09/29/Optimising-LINQ/
What’s the problem with LINQ?
As outlined by Joe Duffy, LINQ introduces inefficiencies in the form of hidden allocations
So the conclusion is:
But I like LINQ a lot and use it frequently.
Upvotes: 1
Reputation: 33
Imho main delay - Math.Round - can be decreased by: 1. for x.Fee : Make Facade object for TraiderItem and save once calculated FeeRound=x.Fee in it (or add property for FeeRound in TraiderItem itself). Just this Math round called 1m*20k times and, probably, round is not powerful part of compiler/cpu pair. 2. convert first lambda into function and calc a.Fee in it and pass into baseItems.Any(.....) as parameter like this:
traiderItems.Where(a => { var aFeeRound = Math.Round((double)a.Fee * 0.4, 8);
return baseItems
.Any(x =>
x.DateUtc == a.DateUtc &&
x.FeeRound == aFeeRound);})
.ToList();
This way Math.Round will work only once for every expression. sorry if mistakes, no time for test. Sure, TPL good idea. Good luck!
Upvotes: 1
Reputation: 45957
First I though of a solution with Hashet<>
or Dictionary<>
but that doesn't really fit into this use case. How about speeding it up by using more of your cores / threads with PLINQ AsParallel()
?
traiderItemsInBase = traiderItems.AsParallel()
.Where(a => baseItems.Any(x => x.DateUtc == a.DateUtc &&
Math.Round(x.Fee, 8) == Math.Round((double)a.Fee * 0.4, 8)))
.ToList();
This should scale pretty good since these operations happen from your memory and not querying a database or another bottleneck. So 4 cores should solve this almost 4x faster.
Upvotes: 7
Reputation: 7610
It has few BaseItem, you can group them by date in a dictionnary :
var baseItemsDic = new Dictionary<DateTime, List<BaseItem>>();
foreach(var item in baseItems)
{
if (!baseItemsDic.ContainsKey(item.DateUtc))
baseItemsDic.Add(item.DateUtc, new List<BaseItem>());
baseItemsDic[item.DateUtc].Add(item);
}
var traiderItemsInBase = traiderItems.Where(a => baseItemsDic.ContainsKey(a.DateUtc) && baseItemsDic[a.DateUtc].Any(x => Math.Round(x.Fee, 8) == Math.Round((double)a.Fee * 0.4, 8))).ToList();
Upvotes: 1
Reputation: 1937
Using that LINQ i.e. Any inside Where is almost like O(N^2)
A better approach is to first create a HashSet where Key is like:
DateUtc.ToString("<Format based on matching depth (like Date or upto minutes/secs>")_Fee Rounded.ToString()
and fill it with all the BaseItem object lists (worst case you will have about 1 Million items in HashSet) (This is equivalent to 1 FOR loop)
Next, Loop through all items in TraiderItem collection (smaller collection) - form the Lookup Key like above. And Check in the HashSet. This is Another For Loop.
Net Time Complexity about - O(N) + O(K) ---> Can improve this by building HashSet in advance or Parallely.
Space Complexity is Higher - but You have too much of Ram now a days :)
Upvotes: 0
Reputation: 181
You could precalculate the rounded fee on both collections. Maybe group the items by date if they duplicate a lot in largest collection.
Upvotes: 1