rurp
rurp

Reputation: 1446

Pandas new column from groupby averages

I have a DataFrame

>>> df = pd.DataFrame({'a':[1,1,1,2,2,2],
...                    'b':[10,20,20,10,20,20],
...                    'result':[100,200,300,400,500,600]})
... 
>>> df
   a   b  result
0  1  10     100
1  1  20     200
2  1  20     300
3  2  10     400
4  2  20     500
5  2  20     600

and want to create a new column that is the average result for the corresponding values for 'a' and 'b'. I can get those values with a groupby:

>>> df.groupby(['a','b'])['result'].mean()
a  b 
1  10    100
   20    250
2  10    400
   20    550
Name: result, dtype: int64

but can not figure out how to turn that into a new column in the original DataFrame. The final result should look like this,

>>> df
   a   b  result  avg_result
0  1  10     100         100
1  1  20     200         250
2  1  20     300         250
3  2  10     400         400
4  2  20     500         550
5  2  20     600         550

I could do this by looping through the combinations of 'a' and 'b' but that would get really slow and unwieldy for larger sets of data. There is probably a much simpler and faster way to go.

Upvotes: 18

Views: 10108

Answers (3)

Adli
Adli

Reputation: 1

you need to reset the index, like:

df.reset_index()

the output should be like you want

Upvotes: 0

Mithun Theertha
Mithun Theertha

Reputation: 171

Since the previous answer(https://stackoverflow.com/a/33445035/6504287) is pandas based, I'm adding the pyspark based solution as in below: So it is better to go with the Window function as in the below code snippet example:

    windowSpecAgg  = Window.partitionBy('a', 'b')
    ext_data_df.withColumn('avg_result', avg('result').over(windowSpecAgg)).show()

The above code is with respect to the example took in the previously provided solution(https://stackoverflow.com/a/33445035/6504287).

Upvotes: 0

Alex Riley
Alex Riley

Reputation: 176850

You need transform:

df['avg_result'] = df.groupby(['a', 'b'])['result'].transform('mean')

This generates a correctly indexed column of the groupby values for you:

   a   b  result  avg_result
0  1  10     100         100
1  1  20     200         250
2  1  20     300         250
3  2  10     400         400
4  2  20     500         550
5  2  20     600         550

Upvotes: 33

Related Questions