Reputation: 397
Apologies in advance for the potentially misleading title. I could not think of the way to properly word the problem without an illustrative example.
I have some data array (e.g.):
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
and a corresponding array of equal length which indicates which elements of x
are grouped:
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
In this example, there are two groupings in x
: [2,2,2,3,3,3,4,4,4]
where y=0
; and [1,1,2,2,3,3]
where y=1
. I want to obtain a statistic on all elements of x
where y
is 0, then 1. I would like this to be extendable to large arrays with many groupings. y
is always ordered from lowest to highest AND is always sequentially increasing without any missing integers between the min and max. For example, y
could be np.array([0,0,**1**,2,2,2,2,3,3,3])
for some x
array of the same length but not y = np.array([0,0,**2**,2,2,2,2,3,3,3])
as this has no ones.
I can do this by brute force quite easily for this example.
import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
y_max = np.max(y)
stat_min = np.zeros(y_max+1)
stat_sum = np.zeros(y_max+1)
for i in np.arange(y_max+1):
stat_min[i] = np.min(x[y==i])
stat_sum[i] = np.sum(x[y==i])
print(stat_min)
print(stat_sum)
Gives: [2. 1.]
and [27. 12.]
for the minimum and sum statistics for each grouping, respectively. I need a way to make this efficient for large numbers of groupings and where the arrays are very large (> 1 million elements).
EDIT
A bit better with list comprehension.
import numpy as np
x = np.array([2,2,2,3,3,3,4,4,4,1,1,2,2,3,3])
y = np.array([0,0,0,0,0,0,0,0,0,1,1,1,1,1,1])
y_max = np.max(y)
stat_min = np.array([np.min(x[y==i]) for i in range(y_max+1)])
stat_sum = np.array([np.sum(x[y==i]) for i in range(y_max+1)])
print(stat_min)
print(stat_sum)
Upvotes: 0
Views: 31
Reputation: 9647
You'd put your arrays into a dataframe, then use groupby
and the various methods of it: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
import pandas as pd
df = pd.DataFrame({'x': x, 'y': y})`
mins = df.groupby('y').min()
Upvotes: 1