Reputation: 331
from statistics import mean
import pandas as pd
df = pd.DataFrame(columns=['A', 'B', 'C'])
df["A"] = [1, 2, 3, 4, 4, 5, 6]
df["B"] = ["Feb", "Feb", "Feb", "May", "May", "May", "May"]
df["C"] = [10, 20, 30, 40, 30, 50, 60]
df1 = df.groupby(["A","B"]).agg(mean_err=("C", mean)).reset_index()
df1["threshold"] = df1["A"] * df1["mean_err"]
Instead of the last line of code, how can I do it as in Pyspark .withColumn() ?
This code wont work. I would like to create new column by using output of operation on the fly similarly like we do in Pyspark withColumn method.
Can anybody have any idea how to do this?
Upvotes: 6
Views: 2125
Reputation: 71707
DataFrame.eval
(df.groupby(['A', 'B'], as_index=False)
.agg(mean_err=('C', 'mean'))
.eval('threshold = A * mean_err'))
DataFrame.assign
(df.groupby(['A', 'B'], as_index=False)
.agg(mean_err=('C', 'mean'))
.assign(threshold=lambda x: x['A'] * x['mean_err']))
A B mean_err threshold
0 1 Feb 10.0 10.0
1 2 Feb 20.0 40.0
2 3 Feb 30.0 90.0
3 4 May 35.0 140.0
4 5 May 50.0 250.0
5 6 May 60.0 360.0
Upvotes: 7