Reputation: 782
I have a pandas DataFrame like below:
df = pd.DataFrame({"A": [1, 1, 1, 2, 2],
"B": ["apple", "apple", "banana", "pineapple", "pineapple"],
"C": [[6, 5, 2], [2, 10, 2], [5, 37, 1], [4, 19, 2], [1, 5, 1]]})
Now I want to perform a groupby-operation on columns A
and B
, and get the average of the lists in column C
. The average of multiple lists is defined as an element-wise average, so the average of all 1st elements in the 1st position of the list, the average of all 2nd elements in the second position of the list and so on...
The desired output for this example looks like this:
A B C
1 apple [4, 7.5, 2]
1 banana [5, 37, 1]
2 pineapple [2.5, 12, 1.5]
(It is always guaranteed that the lists for each group have the same length)
How to solve this?
Usually I know how to perform groupby operations, either as list aggregations or as averages, but I could not find how to do this when comparing multiple lists. Should a groupby operation not be the most efficient solution, I'm also open to other suggestions.
Upvotes: 2
Views: 196
Reputation: 71689
Here, we create a new dataframe from the lists contained in column C
and set the index of this newly created dataframe to columns A
and B
. Now, aggregate this frame by taking mean
on levels present in the index
Then using .values
+ tolist
take the view of mean values as numpy array, convert this view to list and assign to the column C
s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()
Naive approach which can be slower when dealing with big dataframes. Here we group the dataframe by columns A
and B
and apply
a lambda function on column C
, the lambda function then creates a numpy array from the lists and takes mean along axis=0
out = df.groupby(['A', 'B'])['C'].apply(
lambda s: np.array(list(s)).mean(axis=0)).reset_index()
A B C
0 1 apple [4.0, 7.5, 2.0]
1 1 banana [5.0, 37.0, 1.0]
2 2 pineapple [2.5, 12.0, 1.5]
On sample dataframe with 50000 rows and 30000 unique groups
df = pd.concat([df.assign(B=df['B'] + str(i))
for i in range(10000)], ignore_index=True)
%%timeit
s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
_ = out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()
# 173 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
_ = df.groupby(['A', 'B'])['C'].apply(lambda s: np.array(list(s)).mean(axis=0)).reset_index()
# 2.24 s ± 68.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 4
Reputation: 825
Try This
df = df.groupby(['A','B'])['C'].agg(list).reset_index()
df['C'] = df['C'].apply(lambda x: np.mean(x, axis=0))
Output
A B C
0 1 apple [4.0, 7.5, 2.0]
1 1 banana [5.0, 37.0, 1.0]
2 2 pineapple [2.5, 12.0, 1.5]
Upvotes: 2
Reputation: 14949
TRY:
df = pd.concat([df.pop('C').apply(pd.Series), df], 1).groupby(
['A', 'B']).mean().apply(list, 1).reset_index()
or:
df = df.T.apply(pd.Series.explode).T.convert_dtypes().groupby(
['A', 'B']).mean().apply(list, 1).reset_index()
Upvotes: 2