Reputation: 7486
i have array like this :
array([[2, 1],
[3, 5],
[2, 1],
[4, 2],
[2, 3],
[5, 3]])
What i want to do is 'group-by' sum by first column and then sort by second column :
array([[2, 5],
[3, 5],
[5, 3],
[4, 2]])
and here comes the twist, I want to also get back the indexes from the original array of every row in the result array, sorted :
2 3 5 4
[[0,2,4], [1], [5], [3] ]
OR if its easy .. i need to get the top N indexes ... let say top 2 :
2 3
[0,2,4, 1]
No pandas, only pure numpy.
BTW, i need only the top N items and their indexes .. this may simplify speedup the process
trying to apply any of this:
https://izziswift.com/is-there-any-numpy-group-by-function
Upvotes: 4
Views: 1094
Reputation: 1427
You may find useful a simple general purpose group_by
function that I posted for this other question: https://stackoverflow.com/a/77150915/3671939
You can use it to solve this problem as follows:
x = np.array([[2, 1], [3, 5], [2, 1], [4, 2], [2, 3], [5, 3]])
keys, idxs, sums = group_by(x[:, 0], lambda idx: [x[idx[0], 0], idx, x[idx, 1].sum()]).T
N = 2
top = np.argsort(-sums)[:N]
print(keys[top]) # [2 3]
print([idxs[i] for i in top]) # [array[0, 2, 4], array[1]]
Note that the function is fast, but does not exploit that "only the N-top" are requested. This could be done by using np.argpartition
and then sort for the smaller array, although it may not produce any significant benefit as the group_by part has O(n log n) complexity plus many constants:
top = np.argpartition(sums, -N)[-N:]
top = top[np.argsort(-sums[top])]
Upvotes: 1
Reputation: 5026
I'm not happy with this solution and can't verify that it will not break with other data. It's using the referenced idea to group, but sums with add.reduceat
.
a = np.array(
[[2, 1],
[3, 5],
[2, 1],
[4, 2],
[2, 3],
[5, 3]])
s = a[:,0].argsort()
b = a[s]
groups, index = np.unique(b[:,0], return_index=True)
# splits = np.split(b[:,1], index[1:]) # if you need the groups
groupsum = np.stack([groups, np.add.reduceat(b[:,1], index)]).T
groupsum[(groupsum[:,1]*(-1)).argsort()]
Output
array([[2, 5],
[3, 5],
[5, 3],
[4, 2]])
To get the indices for each group
np.stack([groups.astype(object),np.split(np.arange(len(a))[s], index[1:])]).T
Output
array([[2, array([0, 2, 4])],
[3, array([1])],
[4, array([3])],
[5, array([5])]], dtype=object)
Upvotes: 1
Reputation: 50816
There is sadly no group-by in Numpy, but you can use np.unique
to find the unique elements and their index which is enough to implement what you need. One the keys as been identified, you can perform a key-based reduction using np.add.at
. For the sort by value, you can use np.argsort
. See this post and this one for more information.
keys, index = np.unique(df[:,0], return_inverse=True) # Find the unique key to group
values = np.zeros(len(keys), dtype=np.int64) # Sum-based accumulator
np.add.at(values, index, df[:,1]) # Key-based accumulation
tmp = np.hstack([keys[:,None], values[:,None]]) # Build the key-sum 2D array
res = tmp[tmp[:, 1].argsort()[::-1]] # Sort by value
Note that the index can be easily obtained from the index
variable (which is a reversed index). There is no way to build it with Numpy but this is possible using a simple python loop accumulating the index i
in lists stored in a dictionary for in each key keys[index[i]]
. Here is an example:
from collections import defaultdict
d = defaultdict(list)
for i in range(len(df)): d[keys[index[i]]].append(i)
Upvotes: 3