Reputation: 2956
I need to write a query pulling distinct values from columns defined by a user for any given data set. There could be millions of rows so the statements must be as efficient as possible. Below is the code I have.
What is the order of this LINQ query? Is there a more efficient way of doing this?
var MyValues = from r in MyDataTable.AsEnumerable()
orderby r.Field<double>(_varName)
select r.Field<double>(_varName);
IEnumerable result= MyValues.Distinct();
Upvotes: 3
Views: 2620
Reputation: 460288
from comments
I actually use MyDistinct.Distinct()
If you want distinct _varName
values and you cannot do this all in the select query in dbms(what would be the most efficient way), you should use Distinct
before OrderBy
. The order matters here.
You would need to order all million of rows before you start to filter out the duplicates. If you use distinct first, you need to order only the rest.
var values = from r in MyDataTable.AsEnumerable()
select r.Field<double>(_varName);
IEnumerable<double> orderedDistinctValues = values.Distinct()
.OrderBy(d => d);
I have asked a related question recently which E.Lippert answered with a good explanation when order matters and when not:
Order of LINQ extension methods does not affect performance?
Here's a little demo where you can see that the order matters, but you can also see that it does not really matter since comparing doubles is trivial for a cpu:
Time for first orderby then distinct: 00:00:00.0045379
Time for first distinct then orderby: 00:00:00.0013316
Upvotes: 1
Reputation: 107606
I can't speak much to the AsEnumerable()
call or the field conversions, but for the LINQ side of things, the orderby
is a stable quick sort and should be O(n log n)
. If I had to guess, everything but the orderby
should be O(n)
, so overall you're still just O(n log n)
.
Update: the LINQ Distinct()
call should also be O(n)
.
So altogether, the Big-Oh for this thing is still O(Kn log n)
, where K is some constant.
Upvotes: 6
Reputation: 2481
your above query (linq) is good if you want all the million records and you have enough memory on a 64bit memory addressing OS.
the order of the query is, if you see the underlying command, would be transalated to
Select <_varname> from MyDataTable order by <_varname>
and this is as good as it is when run on the database IDE or commandline.
to give you a short answer regarding performance
it may be faster today, but with the tablespace growing, and if your data is not ordered (indexed) thats where things get slowerr (even if you had a good linq expression)
Hope this helps
that said, if your db is not properly indexed, meaning
Upvotes: 0
Reputation: 564851
Is there a more efficent way of doing this?
You could get better efficiency if you do the sort as part of the query that initializes MyDataTable
, instead of sorting in memory afterwards.
Upvotes: 1