Reputation: 2062
I was wondering whether pandas sorting with sort_values()
is a deterministic operation in case of ties, i.e. if calling df.sort_values('foo')
would always return the same sorting, no matter how often I run it?
One example would be
df=pd.DataFrame(np.random.randint(1, 3, 5),columns=["foo"])
df.sort_values(['foo'])
foo
0 1
4 1
1 2
2 2
3 2
I understand that the operation is not stable, but is it deterministic?
Upvotes: 1
Views: 1246
Reputation: 25354
Yes. If you use kind='quicksort'
, the output is deterministic, but not stable.
The reason why quicksort can be nondeterministic is that all quicksort implementations are made up of three steps:
There are three popular ways of implementing step 1.
The first way is deterministic. The second and third ways are nondeterministic.
So, which kind of quicksort does Pandas implement? Pandas dispatches sort_values() to sort_index(), which uses numpy's argsort() to do the sort. How does numpy implement picking the pivot? That's defined in this file.
The pivot element is vp
. It is chosen like so:
/* quicksort partition */
pm = pl + ((pr - pl) >> 1);
[...]
vp = *pm;
How does this work? The variables pr
and pl
are pointers to the beginning and end of the region to be sorted, respectively. If you subtract the two, that is the number of elements to be sorted. If you shift that right once, that's dividing it by 2. So the pm
pointer points to an element halfway into the array. Then pm
is de-referenced to get the pivot element. (Note that this isn't necessarily the median element of the array! It could be the smallest element, or the largest.)
This means that numpy uses the first way to pick elements - it is arbitrary but deterministic. The tradeoff for this is that for some orderings of data, the sort performance will degrade from O(N log N) to O(N^2).
More information about implementing quicksort
Upvotes: 3