Peter
Peter

Reputation: 782

How to perform a groupby operation on a pandas Dataframe where the average over a list column is taken?

I have a pandas DataFrame like below:

df = pd.DataFrame({"A": [1, 1, 1, 2, 2],
                   "B": ["apple", "apple", "banana", "pineapple", "pineapple"],
                   "C": [[6, 5, 2], [2, 10, 2], [5, 37, 1], [4, 19, 2], [1, 5, 1]]})

Now I want to perform a groupby-operation on columns A and B, and get the average of the lists in column C. The average of multiple lists is defined as an element-wise average, so the average of all 1st elements in the 1st position of the list, the average of all 2nd elements in the second position of the list and so on...

The desired output for this example looks like this:

A        B            C
1        apple        [4, 7.5, 2]
1        banana       [5, 37, 1]
2        pineapple    [2.5, 12, 1.5]

(It is always guaranteed that the lists for each group have the same length)

How to solve this?

Usually I know how to perform groupby operations, either as list aggregations or as averages, but I could not find how to do this when comparing multiple lists. Should a groupby operation not be the most efficient solution, I'm also open to other suggestions.

Upvotes: 2

Views: 196

Answers (3)

Shubham Sharma
Shubham Sharma

Reputation: 71689

Approach 1

Here, we create a new dataframe from the lists contained in column C and set the index of this newly created dataframe to columns A and B. Now, aggregate this frame by taking mean on levels present in the index

Then using .values + tolist take the view of mean values as numpy array, convert this view to list and assign to the column C

s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()

Approach 2

Naive approach which can be slower when dealing with big dataframes. Here we group the dataframe by columns A and B and apply a lambda function on column C, the lambda function then creates a numpy array from the lists and takes mean along axis=0

out = df.groupby(['A', 'B'])['C'].apply(
         lambda s: np.array(list(s)).mean(axis=0)).reset_index()

Result

   A          B                 C
0  1      apple   [4.0, 7.5, 2.0]
1  1     banana  [5.0, 37.0, 1.0]
2  2  pineapple  [2.5, 12.0, 1.5]

Performance Profiling

On sample dataframe with 50000 rows and 30000 unique groups

df = pd.concat([df.assign(B=df['B'] + str(i))
               for i in range(10000)], ignore_index=True)


%%timeit
s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
_ = out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()
# 173 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


%%timeit
_ = df.groupby(['A', 'B'])['C'].apply(lambda s: np.array(list(s)).mean(axis=0)).reset_index()
# 2.24 s ± 68.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 4

Pawan Jain
Pawan Jain

Reputation: 825

Try This

df = df.groupby(['A','B'])['C'].agg(list).reset_index()
df['C'] = df['C'].apply(lambda x: np.mean(x, axis=0))

Output

    A   B         C
0   1   apple     [4.0, 7.5, 2.0]
1   1   banana    [5.0, 37.0, 1.0]
2   2   pineapple [2.5, 12.0, 1.5]


Upvotes: 2

Nk03
Nk03

Reputation: 14949

TRY:

df = pd.concat([df.pop('C').apply(pd.Series), df], 1).groupby(
    ['A', 'B']).mean().apply(list, 1).reset_index()

or:

df = df.T.apply(pd.Series.explode).T.convert_dtypes().groupby(
    ['A', 'B']).mean().apply(list, 1).reset_index()

Upvotes: 2

Related Questions