Krzysztof Skowronek
Krzysztof Skowronek

Reputation: 2936

OrderBy and Take in LINQ

I was watching some videos on Youtube with programmers interviews. One of the questions is to write a function that returns n-th smallest element of an array.

In the video I watched a lady try to code that in C++ with some recurrency, but I thought that well, in C# it's one liner:

var nth = vals.OrderBy(x => x).Take(i).Last();

Then I realised that I have no idea whether in fact this is good solution, since the next question was about time complexity. I went to docs and all I found is that the object returned by OrderBy has all the information needed to perform full deferred sort when it is enumerated.

So I decided to test it and wrote a class MyComparable : IComparable<MyComparable> with single value and a static counter in CompareTo method:

 class MyComparable : IComparable<MyComparable>
    {
        public MyComparable(int val)
        {
            Val = val;
        }

        public static int CompareCount { get; set; }

        public int Val { get; set; }

        public int CompareTo(MyComparable other)
        {
            ++CompareCount;

            if (ReferenceEquals(this, other)) return 0;
            if (ReferenceEquals(null, other)) return 1;


            return Val.CompareTo(other.Val);
        }
    }

Then I wrote a loop that finds n-th element in an array:

 static void Main(string[] args)
        {
            var rand = new Random();

            var vals = Enumerable.Range(0, 10000)
//                .Reverse() // pesimistic scenario
//                .OrderBy(x => rand.Next()) // average scenario
                .Select(x => new MyComparable(x))
                .ToArray();


            for (int i = 1; i < 100; i++)
            {
                var nth = vals.OrderBy(x => x).Take(i).Last();
                Console.WriteLine($"i: {i,5}, OrderBy: {MyComparable.CompareCount,10}, value {nth.Val}");
                MyComparable.CompareCount = 0;


                var my_nth = vals.OrderByInsertion().Take(i).Last();
                Console.WriteLine($"i: {i,5}, Insert : {MyComparable.CompareCount,10}, value {my_nth.Val}");
                MyComparable.CompareCount = 0;

                my_nth = vals.OrderByInsertionWithIndex().Take(i).Last();
                Console.WriteLine($"i: {i,5}, Index  : {MyComparable.CompareCount,10}, value {my_nth.Val}");
                MyComparable.CompareCount = 0;

                Console.WriteLine();
                Console.WriteLine();
            }

        }

I also wrote 2 "different" implementations of finding min element, returning it and removing it from a list:

   public static IEnumerable<T> OrderByInsertion<T>(this IEnumerable<T> input) where T : IComparable<T>
        {
            var list = input.ToList();

            while (list.Any())
            {
                var min = list.Min();
                yield return min;
                list.Remove(min);
            }
        }

        public static IEnumerable<T> OrderByInsertionWithIndex<T>(this IEnumerable<T> input) where T : IComparable<T>
        {
            var list = input.ToList();

            while (list.Any())
            {
                var minIndex = 0;


                for (int i = 1; i < list.Count; i++)
                {
                    if (list[i].CompareTo(list[minIndex]) < 0)
                    {
                        minIndex = i;
                    }
                }

                yield return list[minIndex];
                list.RemoveAt(minIndex);
            }
        }

The results really were a surprise to me:

i:     1, OrderBy:      19969, value 0
i:     1, Insert :       9999, value 0
i:     1, Index  :       9999, value 0


i:     2, OrderBy:      19969, value 1
i:     2, Insert :      19997, value 1
i:     2, Index  :      19997, value 1


i:     3, OrderBy:      19969, value 2
i:     3, Insert :      29994, value 2
i:     3, Index  :      29994, value 2


i:     4, OrderBy:      19969, value 3
i:     4, Insert :      39990, value 3
i:     4, Index  :      39990, value 3


i:     5, OrderBy:      19970, value 4
i:     5, Insert :      49985, value 4
i:     5, Index  :      49985, value 4

...

i:    71, OrderBy:      19973, value 70
i:    71, Insert :     707444, value 70
i:    71, Index  :     707444, value 70

...

i:    99, OrderBy:      19972, value 98
i:    99, Insert :     985050, value 98
i:    99, Index  :     985050, value 98

Just using LINQ OrderBy().Take(n) is by far the most efficient and fastest, which is what I anticipated, but would never have guessed the gap being some orders of magnitude.

So, my question is mostly to interviewers: how would you grade such an answer?

Code:

var nth = vals.OrderBy(x => x).Take(i).Last();

Time complexity:

I don't know the details, but probably OrderBy uses some kind quick-sort, no n log(n), no matter of which n-th element we want.

Would you ask me to implement my own methods like those or would it be enough to use the framework?

EDIT:

So, it turns out that like suggested answer below, OrderedEnumerable uses variation of QuickSelect to sort just the elements up to where you ask for. On the plus side, it caches the order.

While you are able to find n-th element a little faster, it's not class faster, it's some percentage faster. Also, each and every C# programmer will understand your one liner.

I think that my answer during the interview would end up somewhere along "I will use OrderBy, because it is fast enough and writing it takes 10s. If it turns out to be to slow, we can gain some time with QucikSelect, but implementing it nicely takes a lot of time"

Thanks everyone who decided to participate in this and sorry to everyone who wasted their time thinking this was something else :)

Upvotes: 1

Views: 743

Answers (1)

Vogel612
Vogel612

Reputation: 5647

Okay let's start with the low-hanging fruit:

Your implementation is wrong. You need to take index + 1 elements from the sequence. To understand this, consider index = 0 and reread the documentation for Take.

Your "benchmark comparison" only works because calling OrderBy() on an IEnumerable does not modify the underlying collection. For what we're going to do it's easier to just allow modifications to the underlying array. As such I've taken the liberty to change your code to generate the values from scratch in every iteration.

Additionally Take(i + 1).Last() is equivalent to ElementAt(i). You should really be using that.

Oh and your benchmark is really not useful, because the more elements of your range you need to consume with Take, the more these algorithms should come closer to each other. As far as I can tell, you're correct with your runtime analysis of O(n log n).

There is a solution that has a time-complexity of O(n) though (not O(log n) as I incorrectly claimed earlier). This is the solution the interviewer was expecting.
For whatever it's worth, the code you've written there is not movable to that solution, because you don't have any control over the sort process.

If you had you could implement Quickselect, (which is the goal here), resulting in a theoretical improvement over the LINQ query you propose here (especially for high indices). The following is a translation of th pseudocode from the wikipedia article on quickselect, based on your code

static T SelectK<T>(T[] values, int left, int right, int index) 
   where T : IComparable<T>
{
    if (left == right) { return values[left]; }
    // could select pivot deterministically through median of 3 or something
    var pivotIndex = rand.Next(left, right + 1);
    pivotIndex = Partition(values, left, right, pivotIndex);
    if (index == pivotIndex) {
        return values[index];
    } else if (index < pivotIndex) {
        return SelectK(values, left, pivotIndex - 1, index);
    } else {
        return SelectK(values, pivotIndex + 1, right, index);
    }
}

static int Partition<T>(T[] values, int left, int right, int pivot) 
    where T : IComparable<T>
{
    var pivotValue = values[pivot];
    Swap(values, pivot, right);
    var storeIndex = left;
    for (var i = left; i < right; i++) {
        if (values[i].CompareTo(pivotValue) < 0)
        {
            Swap(values, storeIndex, i);
            storeIndex++;
        }
    }
    Swap(values, right, storeIndex);
    return storeIndex;
}

A non-representative subsample of a test I've run gives the output:

i:  6724, OrderBy:      52365, value 6723
i:  6724, SelectK:      40014, value 6724


i:   395, OrderBy:      14436, value 394
i:   395, SelectK:      26106, value 395


i:  7933, OrderBy:      32523, value 7932
i:  7933, SelectK:      17712, value 7933


i:  6730, OrderBy:      46076, value 6729
i:  6730, SelectK:      34367, value 6730


i:  6536, OrderBy:      53458, value 6535
i:  6536, SelectK:      18341, value 6536

Since my SelectK implementation uses a random pivot element, there is quite some variation in it's output, (see for example the second run). It's also considerably worse than the highly optimized sorting algorithm implemented in the standard library.
Even then there are cases where SelectK straight up outperforms the standard library even though I didn't put much effort in.

Now replacing the random pivot with a median of 3[1] (which is a pretty bad pivot selector), we can obtain a slightly different SelectK and race that against OrderBy and SelectK.

I've been racing these three horses with 1m elements in the array, using the random sort you already had, requesting an index in the last 20% of the array and got results like the following:

Winning counts: OrderBy 32, SelectK 32, MedianOf3 35
Winning counts: OrderBy 26, SelectK 35, MedianOf3 38 
Winning counts: OrderBy 25, SelectK 41, MedianOf3 33

Even for 100k elements and without restricting the index to the end of the array this pattern seems to persist, though not quite as pronounced:

--- 100k elements
Winning counts: OrderBy 24, SelectK 34, MedianOf3 41
Winning counts: OrderBy 33, SelectK 33, MedianOf3 33
Winning counts: OrderBy 32, SelectK 38, MedianOf3 29
--- 1m elements
Winning counts: OrderBy 27, SelectK 32, MedianOf3 40
Winning counts: OrderBy 32, SelectK 38, MedianOf3 29
Winning counts: OrderBy 35, SelectK 31, MedianOf3 33

Generally speaking, a sloppily implemented quickselect outperforms your suggestion in the average case two thirds of the time... I'd say that's a pretty strong indicator, it's the better algorithm to use if you want to get into the nitty gritty details.

Of course your implementation is significantly easier to understand :)

[1] - Implementation taken from this SO answer, incurring 3 comparisons per recursion depth step

Upvotes: 2

Related Questions