niklasstoffers
niklasstoffers

Reputation: 33

Why does it take more time when you run a LINQ OrderBy before Select?

While writing a solution for a coding problem I discovered an interesting behavior of my LINQ statements. I had two scenarios:

First:

arr.Select(x => x + 5).OrderBy(x => x)

Second:

arr.OrderBy(x => x).Select(x => x + 5)

After a little bit of testing with System.Diagnostics.Stopwatch I got the following results for an integer array of length 100_000.

For the first approach:

00:00:00.0000152

For the second:

00:00:00.0073650

Now I'm interested in why it takes more time if I do the ordering first. I wasn't able to find something on google so I thought about it by myself.

I ended up with 2 Ideas:
1. The second scenario has to convert to IOrderedEnumerable and then back to IEnumerable while the first scenario only has to convert to IOrderedEnumerable and not back.
2. You end up having 2 loops. The first for sorting and the second for the selecting while approach 1 does everything in 1 loop.

So my question is why does it take much more time to do the ordering before select?

Upvotes: 3

Views: 921

Answers (2)

Dmitrii Bychenko
Dmitrii Bychenko

Reputation: 186678

Let's have a look at the sequences:

private static void UnderTestOrderBySelect(int[] arr) {
  var query = arr.OrderBy(x => x).Select(x => x + 5); 

  foreach (var item in query)
    ;
}

private static void UnderTestSelectOrderBy(int[] arr) {
  var query = arr.Select(x => x + 5).OrderBy(x => x);  

  foreach (var item in query)
    ;
}

// See Marc Gravell's comment; let's compare Linq and inplace Array.Sort
private static void UnderTestInPlaceSort(int[] arr) {
  var tmp = arr;
  var x = new int[tmp.Length];

  for (int i = 0; i < tmp.Length; i++)
    x[i] = tmp[i] + 5;

  Array.Sort(x);
}

In order to perform benchmark, let's run 10 times and average 6 middle results:

private static string Benchmark(Action<int[]> methodUnderTest) {
  List<long> results = new List<long>();

  int n = 10;

  for (int i = 0; i < n; ++i) {
    Random random = new Random(1);

    int[] arr = Enumerable
      .Range(0, 10000000)
      .Select(x => random.Next(1000000000))
      .ToArray();

    Stopwatch sw = new Stopwatch();

    sw.Start();

    methodUnderTest(arr);

    sw.Stop();

    results.Add(sw.ElapsedMilliseconds);
  }

  var valid = results
    .OrderBy(x => x)
    .Skip(2)                  // get rid of top 2 runs
    .Take(results.Count - 4)  // get rid of bottom 2 runs
    .ToArray();

  return $"{string.Join(", ", valid)} average : {(long) (valid.Average() + 0.5)}";
}

Time to run and have a look at the results:

  string report = string.Join(Environment.NewLine,
    $"OrderBy + Select: {Benchmark(UnderTestOrderBySelect)}",
    $"Select + OrderBy: {Benchmark(UnderSelectOrderBy)}",
    $"Inplace Sort:     {Benchmark(UnderTestInPlaceSort)}");

  Console.WriteLine(report);

Outcome: (Core i7 3.8GHz, .Net 4.8 IA64)

OrderBy + Select: 4869, 4870, 4872, 4874, 4878, 4895 average : 4876
Select + OrderBy: 4763, 4763, 4793, 4802, 4827, 4849 average : 4800
Inplace Sort:     888, 889, 890, 893, 896, 904 average : 893

I don't see any significant difference, Select + OrderBy seems to be slightly more efficient (about 2% gain) than OrderBy + Select. Inplace Sort, however, has far better performance (5 times) than any of Linq.

Upvotes: 3

MakePeaceGreatAgain
MakePeaceGreatAgain

Reputation: 37000

Depending on which Linq-provider you have, there may happen some optimization on the query. E.g. if you´d use some kind of database, chances are high your provider would create the exact same query for both statements similar to this one:

select myColumn from myTable order by myColumn;

Thus performamce should be identical, no matter if you order first in Linq or select first.

As this does not seem to happen here, you probably use Linq2Objects, which has no optimization at all. So the order of your statements may have an efffect, in particular if you´d have some kind of filter using Where which would filter many objects out so that later statements won´t operate on the entire collection.

To keep long things short: the difference most probably comes from some internal initialzation-logic. As a dataset of 100000 numbers is not really big - at least not big enough - even some fast initialization has a big impact.

Upvotes: 2

Related Questions