Reputation: 5923
I'd like to take the average of one vector based on grouping information in another vector. The two vectors are the same length. I've created a minimal example below based on averaging predictions for each user. How do I do that in NumPy?
>>> pred
[ 0.99 0.23 0.11 0.64 0.45 0.55 0.76 0.72 0.97 ]
>>> users
['User2' 'User3' 'User2' 'User3' 'User0' 'User1' 'User4' 'User4' 'User4']
Upvotes: 1
Views: 2422
Reputation: 10759
A compact solution is to use numpy_indexed (disclaimed: I am its author), which implements a solution similar to the vectorized one proposed by Jaime; but with a cleaner interface and more tests:
import numpy_indexed as npi
npi.group_by(users).mean(pred)
Upvotes: 1
Reputation: 67427
If you want to stick to numpy, the simplest is to use np.unique
and np.bincount
:
>>> pred = np.array([0.99, 0.23, 0.11, 0.64, 0.45, 0.55, 0.76, 0.72, 0.97])
>>> users = np.array(['User2', 'User3', 'User2', 'User3', 'User0', 'User1',
... 'User4', 'User4', 'User4'])
>>> unq, idx, cnt = np.unique(users, return_inverse=True, return_counts=True)
>>> avg = np.bincount(idx, weights=pred) / cnt
>>> unq
array(['User0', 'User1', 'User2', 'User3', 'User4'],
dtype='|S5')
>>> avg
array([ 0.45 , 0.55 , 0.55 , 0.435 , 0.81666667])
Upvotes: 1
Reputation: 74182
A 'pure numpy' solution might use a combination of np.unique
and np.bincount
:
import numpy as np
pred = [0.99, 0.23, 0.11, 0.64, 0.45, 0.55, 0.76, 0.72, 0.97]
users = ['User2', 'User3', 'User2', 'User3', 'User0', 'User1', 'User4',
'User4', 'User4']
# assign integer indices to each unique user name, and get the total
# number of occurrences for each name
unames, idx, counts = np.unique(users, return_inverse=True, return_counts=True)
# now sum the values of pred corresponding to each index value
sum_pred = np.bincount(idx, weights=pred)
# finally, divide by the number of occurrences for each user name
mean_pred = sum_pred / counts
print(unames)
# ['User0' 'User1' 'User2' 'User3' 'User4']
print(mean_pred)
# [ 0.45 0.55 0.55 0.435 0.81666667]
If you have pandas installed, DataFrame
s have some very nice methods for grouping and summarizing data:
import pandas as pd
df = pd.DataFrame({'name':users, 'pred':pred})
print(df.groupby('name').mean())
# pred
# name
# User0 0.450000
# User1 0.550000
# User2 0.550000
# User3 0.435000
# User4 0.816667
Upvotes: 4