Harsh
Harsh

Reputation: 23

How to improve performance while processing data from a huge list in C#?

I have a class Test1 which call Test2 class method.

public class Test1
{
    public void Testmethod1(List<InputData> request)

    {
        //get data from sql : Huge list inputs around more then 150K
        var inputs = new List<InputData>();
        var output = Test2.Testmethod2(inputs);
    }
}

Test2 class has processing method as below:

public class Test2
{
     //request list count 200K
    public static List<OutputData> Testmethod2(List<InputData> request)

    {
        object sync = new Object();
        var output = new List<OutputData>();
        var output1 = new List<OutputData>();
        //data count: 20K
        var data = request.Select(x => x.Input2).Distinct().ToList();
        
        //method calling using for each : processing time 4 hours
        foreach (var n in data)
        {
            output.AddRange(ProcessData(request.Where(x => x.Input1 == n)));

        }

        // method calling using  Parallel.ForEach,processing time 4 hours

        var options = new ParallelOptions { MaxDegreeOfParallelism = 3 };      
        Parallel.ForEach(data, options, n =>
        {
            
            lock (sync)
            {
                output1.AddRange(ProcessData(request.Where(x => x.Input1 == n)));
            }


        });

        return output;
    }



    public static List<OutputData> ProcessData(IEnumerable<InputData> inputData)
    {
        var result = new List<OutputData>();
        //processing on the input data
        return result;

    }

}


public class InputData
{
    public int Input1 { get; set; }
    public int Input2 { get; set; }
    public int Input3 { get; set; }
    public DateTime Input4 { get; set; }
    public int Input5 { get; set; }
    public int Input6 { get; set; }
    public string Input7 { get; set; }
    public int Input8 { get; set; }
    public int Input9 { get; set; }
}

public class OutputData
{
    public int Ouputt1 { get; set; }
    public int Output2 { get; set; }
    public int Output3 { get; set; }
    public int output4 { get; set; }

}

its taking quite a long time to process data around 4 hours.Even Parallel.foreach working like normal one. I am thinking to use Dictionary to store input data however the data is not unique and doesnt have unique row.

Is there a better approach where we can optimize it?

Thanks!

Upvotes: 0

Views: 868

Answers (2)

Sarin
Sarin

Reputation: 1285

(from n in request
//Group items in request by unique values of Input2
group n by n.Input2)
.AsParallel()
.WithDegreeOfParallelism(4)
.Select(data => Test2.ProcessData(
    //Filter inputs
    data.Where(x => x.Input1 == data.Key)
))
.Cast<IEnumerable<OutputData>>()
//Combine the output
.Aggregate(Enumerable.Concat)
//Generate the final list
.ToList();

The idea is to group request by InputData.Input2 values, process the batches in parallel and collect all the results.

Conceptually, this is a variation of @[Panagiotis Kanavos]'s answer

Upvotes: 0

Panagiotis Kanavos
Panagiotis Kanavos

Reputation: 131180

Right now, the code is using brute force to perform 20K full searches for 20K items. That's 400M iterations.

I suspect performance will improve far more simply by creating a dictionary or a lookup (if there are multiple items per key), eg:

var myIndex=request.ToLookup(x=>x.Input1);
var output = new List<OutputData>(20000);
foreach (var n in data)
{
    output.AddRange(ProcessData(myIndex[n]));
}

I specify a capacity in the list constructor to reduce reallocations each time the list's internal buffer gets full. There's no need for a precise value.

If the code is still slow, one approach would be to use Parallel.ForEach or use PLINQ, eg :

var output= ( from n in data.AsParallel().WithDegreeOfParallelism(3)
              let dt=myIndex[n]
              select ProcessData(dt)
            ).ToList();

Upvotes: 1

Related Questions