numpy: efficiently obtain a statistic over array elements grouped by the elements of another array

Question

Apologies in advance for the potentially misleading title. I could not think of the way to properly word the problem without an illustrative example.

I have some data array (e.g.):

 x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])

and a corresponding array of equal length which indicates which elements of x are grouped:

y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])

In this example, there are two groupings in x: [2,2,2,3,3,3,4,4,4] where y=0; and [1,1,2,2,3,3] where y=1. I want to obtain a statistic on all elements of x where y is 0, then 1. I would like this to be extendable to large arrays with many groupings. y is always ordered from lowest to highest AND is always sequentially increasing without any missing integers between the min and max. For example, y could be np.array([0,0,**1**,2,2,2,2,3,3,3]) for some x array of the same length but not y = np.array([0,0,**2**,2,2,2,2,3,3,3]) as this has no ones.

I can do this by brute force quite easily for this example.

import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])

y_max = np.max(y)
stat_min = np.zeros(y_max+1)
stat_sum = np.zeros(y_max+1)

for i in np.arange(y_max+1):
    stat_min[i] = np.min(x[y==i])
    stat_sum[i] = np.sum(x[y==i])

print(stat_min)
print(stat_sum)

Gives: [2. 1.] and [27. 12.] for the minimum and sum statistics for each grouping, respectively. I need a way to make this efficient for large numbers of groupings and where the arrays are very large (> 1 million elements).

EDIT

A bit better with list comprehension.

import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])

y_max = np.max(y)

stat_min = np.array([np.min(x[y==i]) for i in range(y_max+1)])
stat_sum = np.array([np.sum(x[y==i]) for i in range(y_max+1)])

print(stat_min)
print(stat_sum)

cadolphs · Accepted Answer

You'd put your arrays into a dataframe, then use groupby and the various methods of it: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

import pandas as pd

df = pd.DataFrame({'x': x, 'y': y})` 

mins = df.groupby('y').min()

numpy: efficiently obtain a statistic over array elements grouped by the elements of another array

Answers (1)

Related Questions