anita
anita

Reputation: 177

How to sort two-dimensional array by first column as numeric when list of strings? (Python)

I have a two-dimensional array that I'm trying to sort by the first column. However, currently every element in the array is a string, and I'd like the first column to be treated as an integer so I can sort numerically (1, 2, 6, 11... instead of 1, 11, 224, 23,...). I'm using the numpy package to generate these arrays using x = numpy.loadtxt('file.txt', dtype = 'str', delimiter = '\t') which gives me:

array([['140', 'GGC'],
       ['256', 'AGGG'],
       ['841', 'CA'],
       ['46', 'TTATAGG'],
       ['64', 'AGAGAAAGGATTATG'],
       ['156', 'AGC'],
       ['187', 'GGA'],
       ['701', 'TTCG'],
       ['700', 'TC']], 
      dtype='|S15')

1) I know I can convert the first column to integers using:

x[:,0].astype(int)

which outputs a 1-D array. But I'm not sure how to make changes directly to my 2-D array?

2) Once I can convert (or treat) my first column to integers, I know I can sort using:

sorted(x, key=lambda x: x[0])

But is this the best way to do so for my data type?

Upvotes: 3

Views: 3247

Answers (3)

Divakar
Divakar

Reputation: 221664

Since you are working with array data, you can get the sort indices based off the first column using np.argsort and then simply index into the array with those, like so -

x[x[:,0].astype(int).argsort()]

From performance point of view, this should be much better than with lambda as using argsort and then indexing are all vectorized methods which work very efficiently with array data.

Sample run -

In [56]: x
Out[56]: 
array([['140', 'GGC'],
       ['256', 'AGGG'],
       ['841', 'CA'],
       ['46', 'TTATAGG'],
       ['64', 'AGAGAAAGGATTATG'],
       ['156', 'AGC'],
       ['187', 'GGA'],
       ['701', 'TTCG'],
       ['700', 'TC']], 
      dtype='|S15')

In [57]: x[x[:,0].astype(int).argsort()]
Out[57]: 
array([['46', 'TTATAGG'],
       ['64', 'AGAGAAAGGATTATG'],
       ['140', 'GGC'],
       ['156', 'AGC'],
       ['187', 'GGA'],
       ['256', 'AGGG'],
       ['700', 'TC'],
       ['701', 'TTCG'],
       ['841', 'CA']], 
      dtype='|S15')

Upvotes: 2

craidz
craidz

Reputation: 141

You may use the in-built sort functions within numpy:

import numpy as np

dtype = [('id', int), ('seq', '|S15')]
x = np.array([('140', 'GGC'),
              ('256', 'AGGG'),
              ('841', 'CA'),
              ('46', 'TTATAGG'),
              ('64', 'AGAGAAAGGATTATG'),
              ('156', 'AGC'),
              ('187', 'GGA'),
              ('701', 'TTCG'),
              ('700', 'TC')],
             dtype=dtype)

x_copy = np.sort(x, order='id') # quicksort
x_copy = np.sort(x, order='id', kind='mergesort') # stable sort
x.sort(order='id') # in-place quicksort

Specify the data type of the columns of your array at initialization so you don't have to create a view later on, then run the sort. You can do this by specifying dtype= when you first load the data from the text file:

dtype = [('id', int), ('seq', '|S15')]
x = numpy.loadtxt('file.txt', dtype=dtype, delimiter = '\t')

np.sort() creates a copy of the array, which might be slower with larger datasets. x.sort() does it in-place.

You can also specify the algorithm used. Generally, quicksort is the fastest, however if you need a stable sort, use mergesort - it is the only stable sort offered by numpy (i.e. if [(1, 'GGC'), (1, 'GGA'), ...] is sorted, the keys with the same value remain in the same order as they were before sorting, GGC before GGA).

Although quicksort runs in quadratic time as compared to mergesort (which runs in linear-log time), quicksort is usually faster in practice.

Upvotes: 0

Somil
Somil

Reputation: 1941

It is best way to sort this array list

sorted(x, key=lambda x: int(x[0]))

Upvotes: 0

Related Questions