Is pandas sort_values() deterministic in case of ties?

Question

I was wondering whether pandas sorting with sort_values() is a deterministic operation in case of ties, i.e. if calling df.sort_values('foo') would always return the same sorting, no matter how often I run it? One example would be

df=pd.DataFrame(np.random.randint(1, 3, 5),columns=["foo"])
df.sort_values(['foo'])

    foo
0   1
4   1
1   2
2   2
3   2

I understand that the operation is not stable, but is it deterministic?

Nick ODell · Accepted Answer

Yes. If you use kind='quicksort', the output is deterministic, but not stable.

The reason why quicksort can be nondeterministic is that all quicksort implementations are made up of three steps:

Pick a pivot element.
Divide the list into two lists: the elements smaller than the pivot, and the elements larger than the pivot.
Run quicksort on both halves of the list.

There are three popular ways of implementing step 1.

The first way is to arbitrarily pick a pivot element, such as picking the first element, or middle element.
The second way is to pick an element at random.
The third way is to pick several elements at random, and compute a median (or median of medians.)

The first way is deterministic. The second and third ways are nondeterministic.

So, which kind of quicksort does Pandas implement? Pandas dispatches sort_values() to sort_index(), which uses numpy's argsort() to do the sort. How does numpy implement picking the pivot? That's defined in this file.

The pivot element is vp. It is chosen like so:

/* quicksort partition */
pm = pl + ((pr - pl) >> 1);
[...]
vp = *pm;

How does this work? The variables pr and pl are pointers to the beginning and end of the region to be sorted, respectively. If you subtract the two, that is the number of elements to be sorted. If you shift that right once, that's dividing it by 2. So the pm pointer points to an element halfway into the array. Then pm is de-referenced to get the pivot element. (Note that this isn't necessarily the median element of the array! It could be the smallest element, or the largest.)

This means that numpy uses the first way to pick elements - it is arbitrary but deterministic. The tradeoff for this is that for some orderings of data, the sort performance will degrade from O(N log N) to O(N^2).

More information about implementing quicksort

Is pandas sort_values() deterministic in case of ties?

Answers (1)

Related Questions