Reputation: 177
I have a two-dimensional array that I'm trying to sort by the first column. However, currently every element in the array is a string, and I'd like the first column to be treated as an integer so I can sort numerically (1, 2, 6, 11... instead of 1, 11, 224, 23,...). I'm using the numpy package to generate these arrays using x = numpy.loadtxt('file.txt', dtype = 'str', delimiter = '\t')
which gives me:
array([['140', 'GGC'],
['256', 'AGGG'],
['841', 'CA'],
['46', 'TTATAGG'],
['64', 'AGAGAAAGGATTATG'],
['156', 'AGC'],
['187', 'GGA'],
['701', 'TTCG'],
['700', 'TC']],
dtype='|S15')
1) I know I can convert the first column to integers using:
x[:,0].astype(int)
which outputs a 1-D array. But I'm not sure how to make changes directly to my 2-D array?
2) Once I can convert (or treat) my first column to integers, I know I can sort using:
sorted(x, key=lambda x: x[0])
But is this the best way to do so for my data type?
Upvotes: 3
Views: 3247
Reputation: 221664
Since you are working with array
data, you can get the sort indices based off the first column using np.argsort
and then simply index into the array with those, like so -
x[x[:,0].astype(int).argsort()]
From performance point of view, this should be much better than with lambda
as using argsort
and then indexing
are all vectorized methods which work very efficiently with array data.
Sample run -
In [56]: x
Out[56]:
array([['140', 'GGC'],
['256', 'AGGG'],
['841', 'CA'],
['46', 'TTATAGG'],
['64', 'AGAGAAAGGATTATG'],
['156', 'AGC'],
['187', 'GGA'],
['701', 'TTCG'],
['700', 'TC']],
dtype='|S15')
In [57]: x[x[:,0].astype(int).argsort()]
Out[57]:
array([['46', 'TTATAGG'],
['64', 'AGAGAAAGGATTATG'],
['140', 'GGC'],
['156', 'AGC'],
['187', 'GGA'],
['256', 'AGGG'],
['700', 'TC'],
['701', 'TTCG'],
['841', 'CA']],
dtype='|S15')
Upvotes: 2
Reputation: 141
You may use the in-built sort functions within numpy
:
import numpy as np
dtype = [('id', int), ('seq', '|S15')]
x = np.array([('140', 'GGC'),
('256', 'AGGG'),
('841', 'CA'),
('46', 'TTATAGG'),
('64', 'AGAGAAAGGATTATG'),
('156', 'AGC'),
('187', 'GGA'),
('701', 'TTCG'),
('700', 'TC')],
dtype=dtype)
x_copy = np.sort(x, order='id') # quicksort
x_copy = np.sort(x, order='id', kind='mergesort') # stable sort
x.sort(order='id') # in-place quicksort
Specify the data type of the columns of your array at initialization so you don't have to create a view later on, then run the sort. You can do this by specifying dtype=
when you first load the data from the text file:
dtype = [('id', int), ('seq', '|S15')]
x = numpy.loadtxt('file.txt', dtype=dtype, delimiter = '\t')
np.sort()
creates a copy of the array, which might be slower with larger datasets. x.sort()
does it in-place.
You can also specify the algorithm used. Generally, quicksort
is the fastest, however if you need a stable sort, use mergesort
- it is the only stable sort offered by numpy
(i.e. if [(1, 'GGC'), (1, 'GGA'), ...]
is sorted, the keys with the same value remain in the same order as they were before sorting, GGC before GGA).
Although quicksort
runs in quadratic time as compared to mergesort
(which runs in linear-log time), quicksort
is usually faster in practice.
Upvotes: 0
Reputation: 1941
It is best way to sort this array list
sorted(x, key=lambda x: int(x[0]))
Upvotes: 0