Reputation: 215
I've an array data
which has shape (N,6)
. I am sorting this array by the final column such that:
sortx = numpy.argsort( data[:,-1] )[::-1]
sortedData = data[ sortx, : ]
where the [::-1]
is used to sort the column from high to low, instead of low to high, and the data is float64. I then save this sorted array to a .npy
file as follows:
numpy.save( 'file.npy', sortedData )
However, when I load the array back in again and check the sorting of the data, it doesn't appear to be in order! It's only some of the rows, not all of them which is weird.
data_again = numpy.load( 'file.npy' )
order = numpy.argsort( data_again[:,-1] )[::-1]
r = numpy.arange( len(data_again) )
If you compare r
and order
with numpy.sum( order == r)
, you see that this does not equal N. Around 2% are not in the same order!
Firstly, am I understanding the above code correctly? Secondly, can anyone reproduce this? I'm using Python 2.7.2, numpy 1.6.1 on Linux.
Update: This behaviour occurs even directly after the first sorting, and before the save. So it's to do with the sorting itself. There are duplicate values in the sorted column.
Upvotes: 2
Views: 1065
Reputation: 879331
I can reproduce the symptoms if the final column has repeated values:
import numpy as np
np.random.seed(0)
data = np.random.random((8,2))
data[::2,-1] = data[1::2,-1]
print(data)
# [[ 0.5488135 0.54488318]
# [ 0.60276338 0.54488318]
# [ 0.4236548 0.891773 ]
# [ 0.43758721 0.891773 ]
# [ 0.96366276 0.52889492]
# [ 0.79172504 0.52889492]
# [ 0.56804456 0.0871293 ]
# [ 0.07103606 0.0871293 ]]
sortx = np.argsort( data[:,-1] )[::-1]
sorted_data = data[ sortx, : ]
order = np.argsort( sorted_data[:,-1] )[::-1]
r = np.arange( len(sorted_data) )
print(order)
# [1 0 3 2 5 4 7 6]
print(r)
# [0 1 2 3 4 5 6 7]
print(np.allclose(order, r))
# False
np.argsort uses quicksort by default. Quicksort is not stable, so the order of tied rows is not necessarily in the same order as the original data.
However, even if you were to use a stable sort such as mergesort
, when you reverse the result of np.argsort
, for rows that are tied, the higher index comes first.
Thus, when you call np.argsort
a second time, you do not get order
equal to r
.
You could produce that order by sorting the final column and using np.arange(len(data),0,-1)
as the tie-breaker:
sortx = np.lexsort((np.arange(len(data),0,-1), data[:,-1]))[::-1]
sorted_data = data[ sortx, : ]
order = np.lexsort((np.arange(len(data),0,-1), sorted_data[:,-1]))[::-1]
r = np.arange( len(sorted_data) )
print(order)
# [0 1 2 3 4 5 6 7]
print(r)
# [0 1 2 3 4 5 6 7]
print(np.allclose(order, r))
# True
Using np.arange(len(data),0,-1)
places the higher index first (for tied rows), so that when you reveres the indices, the lower index is first.
Upvotes: 4