AshishJindal
AshishJindal

Reputation: 176

How to optimise linq manipulation to get rows

I have list of serial numbers and want to extract the rows from a datatable corresponding to these numbers and against an ID. I am using the below LINQ query for same:

//list of serial numbers
var serialNumAlreadyExisted = [1,2,3];
var varID = 2;

//get the corresponding rows for these serial numbers
var duplicateRows = (from row in dt.AsEnumerable()
    where row.Field<int>("ID") == varID && 
    serialNumAlreadyExisted.Any(sr => sr == row.Field<string>("SERIAL_NUMBER"))
    select row).ToList();

The above code works well for 1-2K rows, but takes a lot of time if there are 50K serial numbers and 50K records in the datatable.

Is there any way to optimize it and reduce the processing time ?

Upvotes: 0

Views: 131

Answers (4)

devuxer
devuxer

Reputation: 42364

I'm not sure if you intended your serial numbers to be a list of int or string, but taking what you have literally, it probably helps performance to do a one-time conversion to string, like so:

var serialStringsAlreadyExisted = serialNumAlreadyExisted.Select(x => x.ToString()).ToList();

Then you can proceed with a join, which should be more efficient than Contains or Any:

var duplicateRows = (
    from row in dt.AsEnumerable()
    where row.Field<int>("ID") == varID
    join serial in serialStringsAlreadyExisted on row.Field<string>("SERIAL_NUMBER") equals serial
    select row)
    .ToList();

Edit

Just performed a quick speed test. Using join completes in about half the time compared your original code for one million rows.

And if you increase the number of items in serialNumAlreadyExisted to 20, using join takes closer to 20% of the time of the baseline method.

Upvotes: 2

amiry jd
amiry jd

Reputation: 27585

First of all, you are linqing against data-table, which means your records all are in memory. So, if you move your logic to a stored procedure and join the serialNumAlreadyExisted with target table, you will have an excellent improvement in performance. But, while you're using datatabale tag, seems it's not an option.

Second, serialNumAlreadyExisted is an int-array, so how you compare it via string?

But, in your short given snippet, there is not much options. Using Any or Contains or joining in-memory data, makes not much different. Using a HashSet for serialNumAlreadyExisted can help a little. But, saying again, move the join to a stored procedure, and you will see the difference.

Upvotes: 1

Hari Prasad
Hari Prasad

Reputation: 16956

I would suggest combination of HashSet and Contains call which gives O(1) access. Any in this case goes through the collection. In worst case scenario it require 3 comparisons.

HashSet<int> serialNumAlreadyExisted = new HashSet<int>();

serialNumAlreadyExisted.Add(1);
serialNumAlreadyExisted.Add(2);
serialNumAlreadyExisted.Add(3);

var duplicateRows = 
    (from row in dt.AsEnumerable()
    where row.Field<int>("ID") == varID && 
          serialNumAlreadyExisted.Contains(row.Field<string>("SERIAL_NUMBER"))
   select row).ToList();

Upvotes: 1

Mrinal Kamboj
Mrinal Kamboj

Reputation: 11478

As the number of rows increase, you find it slow in performance due to a linear search currently being done as it tries to find a number using the following code:

serialNumAlreadyExisted.Any(sr => sr == row.Field<string>("SERIAL_NUMBER"))

Currnt complexity is O(n^2), to make it O(n), is it possible for you to create a Dictionary out of var serialNumAlreadyExisted = [1,2,3];

Assuming following is the code, you can make it better, especially the dictionary value

var testDictionary = serialNumAlreadyExisted.ToDictionary(x=>x,x=>x);

Final code will become as follows:

var duplicateRows = 
    (from row in dt.AsEnumerable()
    where row.Field<int>("ID") == varID && 
          testDictionary.ConatinsKey(row.Field<string>("SERIAL_NUMBER"))
   select row).ToList();

Upvotes: 1

Related Questions