GGG
GGG

Reputation: 71

Sort a string array by element lengths using NumPy

I want to sort a string array using numpy by the length of the elements.

>>> arr = ["year","month","eye","i","stream","key","house"]
>>> x = np.sort(arr, axis=-1, kind='mergesort')
>>> print(x)
['eye' 'house' 'i' 'key' 'month' 'stream' 'year']

But it sorts them in alphanumeric order. How can I sort them using numpy by their length?

Upvotes: 4

Views: 11717

Answers (2)

hpaulj
hpaulj

Reputation: 231605

If I expand your list to arr1=arr*1000, the Python list sort using len as the key function is fastest.

In [77]: len(arr1)
Out[77]: 7000

In [78]: timeit sarr=sorted(arr1,key=len)
100 loops, best of 3: 3.03 ms per loop

In [79]: %%timeit
arrA=np.array(arr1)
larr=[len(i) for i in arrA]  # list comprehension works same as map
sarr=arrA[np.argsort(larr)]
   ....: 
100 loops, best of 3: 7.77 ms per loop

Converting the list to array takes about 1 ms (that conversion adds significant overhead, especially for small lists). Using an already created array, and np.char.str_len the time is still slower than Python sort.

In [83]: timeit sarr=arrA[np.argsort(np.char.str_len(arrA))]
100 loops, best of 3: 6.51 ms per loop

the np.char functions can be convenient, they still basically iterate over the list, applying the corresponding str method.

In general argsort gives you much of the same power as the key function.

Upvotes: 1

sascha
sascha

Reputation: 33532

Add a helper array containing the lenghts of the strings, then use numpy's argsort which gives you the indices which would sort according to these lengths. Index the original data with these indices:

import numpy as np
arr = np.array(["year","month","eye","i","stream","key","house"])  # np-array needed for later indexing
arr_ = map(lambda x: len(x), arr)  # remark: py3 would work different here
x = arr[np.argsort(arr_)]
print(x)

Upvotes: 3

Related Questions