Reputation: 176
I have list of serial numbers and want to extract the rows from a datatable corresponding to these numbers and against an ID. I am using the below LINQ query for same:
//list of serial numbers
var serialNumAlreadyExisted = [1,2,3];
var varID = 2;
//get the corresponding rows for these serial numbers
var duplicateRows = (from row in dt.AsEnumerable()
where row.Field<int>("ID") == varID &&
serialNumAlreadyExisted.Any(sr => sr == row.Field<string>("SERIAL_NUMBER"))
select row).ToList();
The above code works well for 1-2K rows, but takes a lot of time if there are 50K serial numbers and 50K records in the datatable.
Is there any way to optimize it and reduce the processing time ?
Upvotes: 0
Views: 131
Reputation: 42364
I'm not sure if you intended your serial numbers to be a list of int
or string
, but taking what you have literally, it probably helps performance to do a one-time conversion to string
, like so:
var serialStringsAlreadyExisted = serialNumAlreadyExisted.Select(x => x.ToString()).ToList();
Then you can proceed with a join, which should be more efficient than Contains
or Any
:
var duplicateRows = (
from row in dt.AsEnumerable()
where row.Field<int>("ID") == varID
join serial in serialStringsAlreadyExisted on row.Field<string>("SERIAL_NUMBER") equals serial
select row)
.ToList();
Edit
Just performed a quick speed test. Using join
completes in about half the time compared your original code for one million rows.
And if you increase the number of items in serialNumAlreadyExisted
to 20, using join
takes closer to 20% of the time of the baseline method.
Upvotes: 2
Reputation: 27585
First of all, you are linqing against data-table, which means your records all are in memory. So, if you move your logic to a stored procedure and join the serialNumAlreadyExisted
with target table, you will have an excellent improvement in performance. But, while you're using datatabale
tag, seems it's not an option.
Second, serialNumAlreadyExisted
is an int-array, so how you compare it via string
?
But, in your short given snippet, there is not much options. Using Any
or Contains
or joining in-memory data, makes not much different. Using a HashSet
for serialNumAlreadyExisted
can help a little. But, saying again, move the join to a stored procedure, and you will see the difference.
Upvotes: 1
Reputation: 16956
I would suggest combination of HashSet
and Contains
call which gives O(1)
access. Any
in this case goes through the collection. In worst case scenario it require 3 comparisons.
HashSet<int> serialNumAlreadyExisted = new HashSet<int>();
serialNumAlreadyExisted.Add(1);
serialNumAlreadyExisted.Add(2);
serialNumAlreadyExisted.Add(3);
var duplicateRows =
(from row in dt.AsEnumerable()
where row.Field<int>("ID") == varID &&
serialNumAlreadyExisted.Contains(row.Field<string>("SERIAL_NUMBER"))
select row).ToList();
Upvotes: 1
Reputation: 11478
As the number of rows increase, you find it slow in performance due to a linear search currently being done as it tries to find a number using the following code:
serialNumAlreadyExisted.Any(sr => sr == row.Field<string>("SERIAL_NUMBER"))
Currnt complexity is O(n^2), to make it O(n)
, is it possible for you to create a Dictionary out of var serialNumAlreadyExisted = [1,2,3];
Assuming following is the code, you can make it better, especially the dictionary value
var testDictionary = serialNumAlreadyExisted.ToDictionary(x=>x,x=>x);
Final code will become as follows:
var duplicateRows =
(from row in dt.AsEnumerable()
where row.Field<int>("ID") == varID &&
testDictionary.ConatinsKey(row.Field<string>("SERIAL_NUMBER"))
select row).ToList();
Upvotes: 1