aam
aam

Reputation: 23

calculate aggregated variance for each group in python

I have a data frame (df) with these columns: user, vector, and group.

df = pd.DataFrame({'user': ['user_1', 'user_2', 'user_3', 'user_4', 'user_5',  'user_6'], 'vector': [[1, 0, 2, 0], [1, 8, 0, 2],[6, 2, 0, 0], [5, 0, 2, 2], [3, 8, 0, 0],[6, 0, 0, 2]], 'group': ['A', 'B', 'C', 'B', 'A', 'A']})

I want to calculate aggregated variance for each group.

I tried this code, but it return an error

aggregated_variance = (df.groupby('group', as_index=False)['vector'].agg(["var"]))

ValueError: no results

Upvotes: 1

Views: 886

Answers (2)

Cameron Riddell
Cameron Riddell

Reputation: 13457

You can use .explode to clean up your data and then perform a .groupby operation:

out = (
    df.explode('vector')
    .groupby('group')['vector'].var(ddof=1)
)

print(out)
group
A    7.060606
B    7.428571
C    8.000000
Name: vector, dtype: float64

The trick here lies in the use of .explode:

>>> df.head()
     user        vector group
0  user_1  [1, 0, 2, 0]     A
1  user_2  [1, 8, 0, 2]     B
2  user_3  [6, 2, 0, 0]     C
3  user_4  [5, 0, 2, 2]     B
4  user_5  [3, 8, 0, 0]     A

>>> df.explode('vector').head()
     user vector group
0  user_1      1     A
0  user_1      0     A
0  user_1      2     A
0  user_1      0     A
1  user_2      1     B
...

Upvotes: 3

user3901917
user3901917

Reputation: 167

If you take the sum() after you group df, you will have a dataframe that shows a list of all vector values for each group. Then, create a lambda function to calculate the variance of each list of vector values.

aggregated = df.groupby("group").sum()['vector']
aggregated_variance = aggregated.apply(lambda x: np.var(x)).reset_index()

Upvotes: 3

Related Questions