Reputation: 451
I need to conditionally update a column in dataframe based on values present in one of the columns.
For example, based on values in COL9, I want a NEW column COL10 to have values A/B/C (lets say based on below criteria)
0.00-0.50 : A
0.51-0.75 : B
0.75-1.00 : C
Expected Output :
col1..col8 col9 col10
0 0.788310 0.211690 A
1 0.293871 0.706129 B
2 0.002207 0.997793 C
3 0.047834 0.952166 C
Can this be done in a performance efficient manner ?
Upvotes: 1
Views: 1058
Reputation: 13001
While there is a very good answer for using pandas dataframes, since you mentioned pyspark in the tag I assume you mean spark dataframes?
If so you can do something like this:
import pyspark.sql.functions.when,lit
newDF = df.withColumn("col10", when(df["col9"] < 0.5, lit("A")).otherwise(when(df["col9"] > 0.75, lit("C")).otherwise(lit("B"))
I assumed legal values for the columns are 0-1 but if you need to explicitly check them you can simply change the conditions and add an additional when/otherwise for illegal values
Upvotes: 1
Reputation: 191
This is a perfect situation for a User-Defined Function (UDF). If you need more flexibility (create more than one column from you input, for instance), then you can look at transformers.
Your UDF would look something like the following:
from pyspark.sql.functions import udf
def colToString(num):
if num >= 0 and num < 0.5: return 'A'
elif num >= 0.5 and num < 0.75: return 'B'
elif num >= 0.75 and num < 1.0: return 'C'
else: return 'D'
myUdf = udf(colToString, StringType())
df.withColumn("col10", myUdf('col9'))
Here, myUdf
takes a parameter which is a double and returns a string. The double value is read from the input column, col9
.
The key is to use dataframe operations to perform this, not Pandas. Pandas will not perform your operations in a distributed manner, while Spark will.
Upvotes: 1
Reputation: 215117
You can use pd.cut()
and label the categories the way you wanted:
import pandas as pd
df['col10'] = pd.cut(df['col9'], [0, 0.5, 0.75, 1], labels = list("ABC"))
Upvotes: 4