Innovator-programmer
Innovator-programmer

Reputation: 331

How to create new column dynamically in pandas like we do in pyspark withColumn

from statistics import mean
import pandas as pd
df = pd.DataFrame(columns=['A', 'B', 'C'])
df["A"] = [1, 2, 3, 4, 4, 5, 6]
df["B"] = ["Feb", "Feb", "Feb", "May", "May", "May", "May"]
df["C"] = [10, 20, 30, 40, 30, 50, 60]
df1 = df.groupby(["A","B"]).agg(mean_err=("C", mean)).reset_index()

df1["threshold"] = df1["A"] * df1["mean_err"]

Instead of the last line of code, how can I do it as in Pyspark .withColumn() ?

enter image description here

This code wont work. I would like to create new column by using output of operation on the fly similarly like we do in Pyspark withColumn method.

Can anybody have any idea how to do this?

Upvotes: 6

Views: 2125

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71707

Option 1: DataFrame.eval

(df.groupby(['A', 'B'], as_index=False)
   .agg(mean_err=('C', 'mean'))
   .eval('threshold = A * mean_err'))

Option 2: DataFrame.assign

(df.groupby(['A', 'B'], as_index=False)
   .agg(mean_err=('C', 'mean'))
   .assign(threshold=lambda x: x['A'] * x['mean_err']))

   A    B  mean_err  threshold
0  1  Feb      10.0       10.0
1  2  Feb      20.0       40.0
2  3  Feb      30.0       90.0
3  4  May      35.0      140.0
4  5  May      50.0      250.0
5  6  May      60.0      360.0

Upvotes: 7

Related Questions