Reputation: 806

Filter large list object on data from another large list: slow performance

I have two large lists of object. First (about of 1 000 000 objects):

public class BaseItem
{
    public BaseItem()
    {

    }

    public double Fee { get; set; } = 0;

    public string Market { get; set; } = string.Empty;

    public string Traider { get; set; } = string.Empty;

    public DateTime DateUtc { get; set; } = new DateTime();
}

Second (about of 20 000 objects):

public class TraiderItem
{
    public TraiderItem()
    {

    }

    public DateTime DateUtc { get; set; } = new DateTime();

    public string Market { get; set; } = string.Empty;

    public string Type { get; set; } = string.Empty;

    public double Price { get; set; } = 0;

    public double Amount { get; set; } = 0;

    public double Total { get; set; } = 0;

    public double Fee { get; set; } = 0;

    public string FeeCoin { get; set; } = string.Empty;
}

I need to find all Traider items in Base items when DateUtc are equals and Fee are equals. Now i am using Any method:

traiderItemsInBase = traiderItems.Where(a => baseItems.Any(x => x.DateUtc == a.DateUtc && Math.Round(x.Fee, 8) == Math.Round((double)a.Fee * 0.4, 8))).ToList();

But this way is very-very slow. Is there a way to make this more efficient?Is there possibility to use HashSet in this case?

Upvotes: 6

Answers (6)

Bobo

Reputation: 335

I have tried some suggestions and this is so far the fastest I could get:

private static void TestForPreCountingParallel(List<TraiderItem> traiderItems, List<BaseItem> baseItems)
        {
            var watch = new Stopwatch();
            watch.Start();
            ConcurrentBag<TraiderItem> traiderItemsInBase = null;
            for (int i = 0; i < 3; i++)
            {
                traiderItemsInBase = new ConcurrentBag<TraiderItem>();
                var baseFeesRounds = baseItems.Select(bi => Math.Round((double)bi.Fee * 0.4, 8)).ToArray();
                Parallel.ForEach(traiderItems, traiderItem =>
                {
                    double traiderFeeRound = Math.Round(traiderItem.Fee, 8);
                    for (var index = 0; index < baseItems.Count; index++)
                    {
                        var baseItem = baseItems[index];
                        if (traiderItem.DateUtc == baseItem.DateUtc && traiderFeeRound == baseFeesRounds[index])
                        {
                            traiderItemsInBase.Add(traiderItem);
                            break;
                        }
                    }
                });

                Console.WriteLine(i + "," + watch.ElapsedMilliseconds);
            }

            watch.Stop();
            Console.WriteLine("base:{0},traid:{1},res:{2},time:{3}", baseItems.Count, traiderItems.Count,
                traiderItemsInBase.Count, watch.ElapsedMilliseconds);
        }

Anyone have another improvement?

For the things I have tried, it is like this:

Original Linq: base:100000,traid:20000,res:40,time:102544
Converted to foreach loops: base:100000,traid:20000,res:40,time:43890
Precounting fees: base:100000,traid:20000,res:40,time:22661
Parallel outer loop: base:100000,traid:20000,res:40,time:6823

Times are not significant, the trend is what to look at. The benchmark is not perfect, I haven't played much with ratio of TraiderItems inside BaseItems, my own is pretty low as you can see. 40 from 100000.

So just to see some different ratios:

base:100000,traid:20000,res:400,time:102417
base:100000,traid:20000,res:400,time:50842
base:100000,traid:20000,res:400,time:21754
base:100000,traid:20000,res:400,time:8296

And another:

base:100000,traid:20000,res:2000,time:118150
base:100000,traid:20000,res:2000,time:57832
base:100000,traid:20000,res:2000,time:21659
base:100000,traid:20000,res:2000,time:7350

I am not an expert, so I have to refer to other sources like: http://mattwarren.org/2016/09/29/Optimising-LINQ/

What’s the problem with LINQ?

As outlined by Joe Duffy, LINQ introduces inefficiencies in the form of hidden allocations

So the conclusion is:

do your own benchmark and try some code changes first if you really care about the performance. Just adding brute force to inefficient code is going to cost someone money.
do not use complex LINQ queries for large collection unless you test the performance. I have got burned on that, the threshold is surprisingly low (like 10k items with wrong LINQ can kill your processing time when simple loop is working well).

But I like LINQ a lot and use it frequently.

Upvotes: 1

AndrewF

Reputation: 33

Imho main delay - Math.Round - can be decreased by: 1. for x.Fee : Make Facade object for TraiderItem and save once calculated FeeRound=x.Fee in it (or add property for FeeRound in TraiderItem itself). Just this Math round called 1m*20k times and, probably, round is not powerful part of compiler/cpu pair. 2. convert first lambda into function and calc a.Fee in it and pass into baseItems.Any(.....) as parameter like this:

traiderItems.Where(a => { var aFeeRound = Math.Round((double)a.Fee * 0.4, 8);
                      return baseItems
                      .Any(x =>
                         x.DateUtc == a.DateUtc && 
                         x.FeeRound == aFeeRound);})
        .ToList();

This way Math.Round will work only once for every expression. sorry if mistakes, no time for test. Sure, TPL good idea. Good luck!

Upvotes: 1

fubo

Reputation: 45957

First I though of a solution with Hashet<> or Dictionary<> but that doesn't really fit into this use case. How about speeding it up by using more of your cores / threads with PLINQ AsParallel()?

traiderItemsInBase = traiderItems.AsParallel()
    .Where(a => baseItems.Any(x => x.DateUtc == a.DateUtc &&
                              Math.Round(x.Fee, 8) == Math.Round((double)a.Fee * 0.4, 8)))
    .ToList();

This should scale pretty good since these operations happen from your memory and not querying a database or another bottleneck. So 4 cores should solve this almost 4x faster.

Upvotes: 7

vernou

Reputation: 7610

It has few BaseItem, you can group them by date in a dictionnary :

    var baseItemsDic = new Dictionary<DateTime, List<BaseItem>>();
    foreach(var item in baseItems)
    {
        if (!baseItemsDic.ContainsKey(item.DateUtc))
            baseItemsDic.Add(item.DateUtc, new List<BaseItem>());
        baseItemsDic[item.DateUtc].Add(item);
    }


    var traiderItemsInBase = traiderItems.Where(a => baseItemsDic.ContainsKey(a.DateUtc) && baseItemsDic[a.DateUtc].Any(x => Math.Round(x.Fee, 8) == Math.Round((double)a.Fee * 0.4, 8))).ToList();

Upvotes: 1

Prateek Shrivastava

Reputation: 1937

Using that LINQ i.e. Any inside Where is almost like O(N^2)

A better approach is to first create a HashSet where Key is like:

DateUtc.ToString("<Format based on matching depth (like Date or upto minutes/secs>")_Fee Rounded.ToString()

and fill it with all the BaseItem object lists (worst case you will have about 1 Million items in HashSet) (This is equivalent to 1 FOR loop)

Next, Loop through all items in TraiderItem collection (smaller collection) - form the Lookup Key like above. And Check in the HashSet. This is Another For Loop.

Net Time Complexity about - O(N) + O(K) ---> Can improve this by building HashSet in advance or Parallely.

Space Complexity is Higher - but You have too much of Ram now a days :)

Upvotes: 0

AccessViolation

Reputation: 181

You could precalculate the rounded fee on both collections. Maybe group the items by date if they duplicate a lot in largest collection.

Upvotes: 1

Filter large list object on data from another large list: slow performance

Answers (6)

Related Questions