sten
sten

Reputation: 7486

numpy group by, returning original indexes sorted by the result

i have array like this :

array([[2, 1],
       [3, 5],
       [2, 1],
       [4, 2],
       [2, 3],
       [5, 3]])

What i want to do is 'group-by' sum by first column and then sort by second column :

array([[2, 5],
       [3, 5],
       [5, 3],
       [4, 2]])

and here comes the twist, I want to also get back the indexes from the original array of every row in the result array, sorted :

     2       3     5    4
 [[0,2,4],  [1],  [5], [3] ]

OR if its easy .. i need to get the top N indexes ... let say top 2 :

     2       3    
  [0,2,4,    1]

No pandas, only pure numpy.

BTW, i need only the top N items and their indexes .. this may simplify speedup the process


trying to apply any of this:

https://izziswift.com/is-there-any-numpy-group-by-function

Upvotes: 4

Views: 1094

Answers (3)

Carlos Pinzón
Carlos Pinzón

Reputation: 1427

You may find useful a simple general purpose group_by function that I posted for this other question: https://stackoverflow.com/a/77150915/3671939

You can use it to solve this problem as follows:

x = np.array([[2, 1], [3, 5], [2, 1], [4, 2], [2, 3], [5, 3]])
keys, idxs, sums = group_by(x[:, 0], lambda idx: [x[idx[0], 0], idx, x[idx, 1].sum()]).T

N = 2
top = np.argsort(-sums)[:N]
print(keys[top])  # [2 3]
print([idxs[i] for i in top])  # [array[0, 2, 4], array[1]]

Note that the function is fast, but does not exploit that "only the N-top" are requested. This could be done by using np.argpartition and then sort for the smaller array, although it may not produce any significant benefit as the group_by part has O(n log n) complexity plus many constants:

top = np.argpartition(sums, -N)[-N:]
top = top[np.argsort(-sums[top])]

Upvotes: 1

Michael Szczesny
Michael Szczesny

Reputation: 5026

I'm not happy with this solution and can't verify that it will not break with other data. It's using the referenced idea to group, but sums with add.reduceat.

a = np.array(
      [[2, 1],
       [3, 5],
       [2, 1],
       [4, 2],
       [2, 3],
       [5, 3]])

s = a[:,0].argsort()
b = a[s]
groups, index = np.unique(b[:,0], return_index=True)
# splits = np.split(b[:,1], index[1:]) # if you need the groups
groupsum = np.stack([groups, np.add.reduceat(b[:,1], index)]).T
groupsum[(groupsum[:,1]*(-1)).argsort()]

Output

array([[2, 5],
       [3, 5],
       [5, 3],
       [4, 2]])

To get the indices for each group

np.stack([groups.astype(object),np.split(np.arange(len(a))[s], index[1:])]).T

Output

array([[2, array([0, 2, 4])],
       [3, array([1])],
       [4, array([3])],
       [5, array([5])]], dtype=object)

Upvotes: 1

Jérôme Richard
Jérôme Richard

Reputation: 50816

There is sadly no group-by in Numpy, but you can use np.unique to find the unique elements and their index which is enough to implement what you need. One the keys as been identified, you can perform a key-based reduction using np.add.at. For the sort by value, you can use np.argsort. See this post and this one for more information.

keys, index = np.unique(df[:,0], return_inverse=True) # Find the unique key to group
values = np.zeros(len(keys), dtype=np.int64)          # Sum-based accumulator
np.add.at(values, index, df[:,1])                     # Key-based accumulation
tmp = np.hstack([keys[:,None], values[:,None]])       # Build the key-sum 2D array
res = tmp[tmp[:, 1].argsort()[::-1]]                  # Sort by value

Note that the index can be easily obtained from the index variable (which is a reversed index). There is no way to build it with Numpy but this is possible using a simple python loop accumulating the index i in lists stored in a dictionary for in each key keys[index[i]]. Here is an example:

from collections import defaultdict
d = defaultdict(list)
for i in range(len(df)): d[keys[index[i]]].append(i)

Upvotes: 3

Related Questions