Reputation: 1471

How to sort a NumPy array of strings by the last column

Is there a way to sort the rows of an array by the last element, in this case the cell ids. The cell id is build as follows : "CellID_NumberOfCell

arr =np.array([['65.0','30.0','20.0','0.0','0_0'],
 ['2.0','29.0','24.0','0.0','1_0'],
 ['0.0','18.0','4.0','0.0','2_0'],
 ['16.0','9.0','0.0','9990.0','7_203'],
 ['16.0','9.0','0.0','9990.0','0_203'],
 ['20.0','23.0','31.0','9990.0','8_158'],
 ['65.0','30.0','20.0','0.0','0_10']])

So after sorting it should look like:

arr =np.array([['65.0','30.0','20.0','0.0','0_0'],
 ['65.0','30.0','20.0','0.0','0_10'],
 ['16.0','9.0','0.0','9990.0','0_203'],
 ['2.0','29.0','24.0','0.0','1_0'],
 ['0.0','18.0','4.0','0.0','2_0'],
 ['16.0','9.0','0.0','9990.0','7_203'],
 ['20.0','23.0','31.0','9990.0','8_158']])

EDIT:

Is it also possible to delete the numbers after the underscore after sorting?. So that i just have the ID. Instead of 0_0 just 0.

EDIT2

After sorting the ID, it should also sort after time, so that every ID with 0 for example should also be sorted after time 0,1...9999 etc.

Upvotes: 1

Answers (3)

Divakar

Reputation: 221664

We need to split the last column by that underscore, lexsort it and then use those indices to sort the input array.

Thus, an implementation would be -

def numpy_app(arr):
    # Extract out the strings on last column split based on '_'.
    # Thus, for given sample we would have the last column would be
    # split further into 3 columns, the middle one being of '_''s.
    a = np.core.defchararray.partition(arr[:,-1],'_')

    # Lexsort it on the last numeric cols (0,2). We need to flip
    # the order of columns to give precedence to the first string
    sidx = np.lexsort(a[:,2::-2].astype(int).T)

    # Index into input array with lex-sorted indices for final o/p
    return arr[sidx]

Based on the edits in the question, it seems we want to cut out the string after the underscore. To do so, here's a modified version -

def numpy_cut_app(arr):
    a = np.core.defchararray.partition(arr[:,-1],'_')
    sidx = np.lexsort(a[:,2::-2].astype(int).T)
    out = arr[sidx]

    # Replace the last column with the first string off the last column's split one
    out[:,-1] = a[sidx,0]
    return out

Based on more edits, it seems we want to include the fourth column into lex-sorting and neglect everything after the underscore in the last column. So, a further modified version would be -

def numpy_cut_col3_app(arr):
    a = np.core.defchararray.partition(arr[:,-1],'_')

    # Lex-sort using first off the split strings from last col(precedence to it)
    # and col-3 of input array
    sidx = np.lexsort([arr[:,3].astype(float), a[:,0]])
    out = arr[sidx]
    out[:,-1] = a[sidx,0]
    return out

Sample runs -

In [567]: arr
Out[567]: 
array([['65.0', '30.0', '20.0', '0.0', '9_49'],
       ['2.0', '29.0', '24.0', '0.0', '1_0'],
       ['0.0', '18.0', '4.0', '0.0', '2_0'],
       ['16.0', '9.0', '0.0', '9990.0', '7_203'],
       ['16.0', '9.0', '0.0', '9990.0', '9_5'],
       ['20.0', '23.0', '31.0', '9990.0', '8_158'],
       ['65.0', '30.0', '20.0', '0.0', '9_50']], 
      dtype='|S6')

In [568]: numpy_app(arr)
Out[568]: 
array([['2.0', '29.0', '24.0', '0.0', '1_0'],
       ['0.0', '18.0', '4.0', '0.0', '2_0'],
       ['16.0', '9.0', '0.0', '9990.0', '7_203'],
       ['20.0', '23.0', '31.0', '9990.0', '8_158'],
       ['16.0', '9.0', '0.0', '9990.0', '9_5'],
       ['65.0', '30.0', '20.0', '0.0', '9_49'],
       ['65.0', '30.0', '20.0', '0.0', '9_50']], 
      dtype='|S6')

In [569]: numpy_cut_app(arr)
Out[569]: 
array([['2.0', '29.0', '24.0', '0.0', '1'],
       ['0.0', '18.0', '4.0', '0.0', '2'],
       ['16.0', '9.0', '0.0', '9990.0', '7'],
       ['20.0', '23.0', '31.0', '9990.0', '8'],
       ['16.0', '9.0', '0.0', '9990.0', '9'],
       ['65.0', '30.0', '20.0', '0.0', '9'],
       ['65.0', '30.0', '20.0', '0.0', '9']], 
      dtype='|S6')

Upvotes: 2

Tbaki

Reputation: 1003

You can do it easely with sorted and lambda function and as suggested by @Divakar to get the numpy array back:

np.array(sorted(arr, key=lambda x :x[-1]))

output

[['65.0', '30.0', '20.0', '0.0', '0_0'],
['65.0', '30.0', '20.0', '0.0', '0_10'],
['16.0', '9.0', '0.0', '9990.0', '0_203'],
['2.0', '29.0', '24.0', '0.0', '1_0'],
['0.0', '18.0', '4.0', '0.0', '2_0'],
['16.0', '9.0', '0.0', '9990.0', '7_203'],
['20.0', '23.0', '31.0', '9990.0', '8_158']]

EDIT : you can do it by using this, not pretty, but does the work

np.array([ np.append(i[:-1],i[-1].split("_")[0]) for i in sorted(list(arr), key=lambda x :x[-1])])

ouput

array([['65.0', '30.0', '20.0', '0.0', '0'],
       ['65.0', '30.0', '20.0', '0.0', '0'],
       ['16.0', '9.0', '0.0', '9990.0', '0'],
       ['2.0', '29.0', '24.0', '0.0', '1'],
       ['0.0', '18.0', '4.0', '0.0', '2'],
       ['16.0', '9.0', '0.0', '9990.0', '7'],
       ['20.0', '23.0', '31.0', '9990.0', '8']], 
      dtype='<U6')

Upvotes: 2

P. Camilleri

Reputation: 13218

np.argsort(arr[:, -1]) will give you the permutation so that elements of the last column of arr are ordered.

Then, arr[np.argsort(arr[:, -1])] reorders the rows of arr according to this permutation.

Beware that the lexicographic order is used since your data consists of string, so 0_10 comes before 0_2. If this is not what you want, you should split the last column, and I advise you to use a pandas.DataFrame:

import pandas as pd
df = pd.DataFrame(arr)
df['Cell'], df['CellIndex'] = df[df.columns[-1]].str.split('_', 1).str
df['Cell'] = df['Cell'].astype(int)
df['CellIndex'] = df['CellIndex'].astype(int)
df.sort_values(['Cell', 'CellIndex'])

pandas is really the way to go to manipulate this kind of data.

Upvotes: 5

How to sort a NumPy array of strings by the last column

Answers (3)

Related Questions