Carl M
Carl M

Reputation: 215

numpy 1.6.1 argsort() strange behaviour?

I've an array data which has shape (N,6). I am sorting this array by the final column such that:

sortx = numpy.argsort( data[:,-1] )[::-1]
sortedData = data[ sortx, : ]

where the [::-1] is used to sort the column from high to low, instead of low to high, and the data is float64. I then save this sorted array to a .npy file as follows:

numpy.save( 'file.npy', sortedData )

However, when I load the array back in again and check the sorting of the data, it doesn't appear to be in order! It's only some of the rows, not all of them which is weird.

data_again = numpy.load( 'file.npy' )
order = numpy.argsort( data_again[:,-1] )[::-1]
r = numpy.arange( len(data_again) )

If you compare r and order with numpy.sum( order == r), you see that this does not equal N. Around 2% are not in the same order!

Firstly, am I understanding the above code correctly? Secondly, can anyone reproduce this? I'm using Python 2.7.2, numpy 1.6.1 on Linux.

Update: This behaviour occurs even directly after the first sorting, and before the save. So it's to do with the sorting itself. There are duplicate values in the sorted column.

Upvotes: 2

Views: 1065

Answers (1)

unutbu
unutbu

Reputation: 879331

I can reproduce the symptoms if the final column has repeated values:

import numpy as np
np.random.seed(0)
data = np.random.random((8,2))
data[::2,-1] = data[1::2,-1]
print(data)
# [[ 0.5488135   0.54488318]
#  [ 0.60276338  0.54488318]
#  [ 0.4236548   0.891773  ]
#  [ 0.43758721  0.891773  ]
#  [ 0.96366276  0.52889492]
#  [ 0.79172504  0.52889492]
#  [ 0.56804456  0.0871293 ]
#  [ 0.07103606  0.0871293 ]]
sortx = np.argsort( data[:,-1] )[::-1]
sorted_data = data[ sortx, : ]

order = np.argsort( sorted_data[:,-1] )[::-1]
r = np.arange( len(sorted_data) )
print(order)
# [1 0 3 2 5 4 7 6]
print(r)
# [0 1 2 3 4 5 6 7]
print(np.allclose(order, r))
# False

np.argsort uses quicksort by default. Quicksort is not stable, so the order of tied rows is not necessarily in the same order as the original data.

However, even if you were to use a stable sort such as mergesort, when you reverse the result of np.argsort, for rows that are tied, the higher index comes first.

Thus, when you call np.argsort a second time, you do not get order equal to r.

You could produce that order by sorting the final column and using np.arange(len(data),0,-1) as the tie-breaker:

sortx = np.lexsort((np.arange(len(data),0,-1), data[:,-1]))[::-1]
sorted_data = data[ sortx, : ]

order = np.lexsort((np.arange(len(data),0,-1), sorted_data[:,-1]))[::-1]
r = np.arange( len(sorted_data) )
print(order)
# [0 1 2 3 4 5 6 7]
print(r)
# [0 1 2 3 4 5 6 7]
print(np.allclose(order, r))
# True

Using np.arange(len(data),0,-1) places the higher index first (for tied rows), so that when you reveres the indices, the lower index is first.

Upvotes: 4

Related Questions