Shankar Pandey
Shankar Pandey

Reputation: 451

Conditional Update of column in DataFrame in python

I need to conditionally update a column in dataframe based on values present in one of the columns.

For example, based on values in COL9, I want a NEW column COL10 to have values A/B/C (lets say based on below criteria)

0.00-0.50 : A
0.51-0.75 : B
0.75-1.00 : C

Expected Output :

      col1..col8      col9     col10
0      0.788310     0.211690      A
1      0.293871     0.706129      B
2      0.002207     0.997793      C
3      0.047834     0.952166      C

Can this be done in a performance efficient manner ?

Upvotes: 1

Views: 1058

Answers (3)

Assaf Mendelson
Assaf Mendelson

Reputation: 13001

While there is a very good answer for using pandas dataframes, since you mentioned pyspark in the tag I assume you mean spark dataframes?

If so you can do something like this:

import pyspark.sql.functions.when,lit

newDF = df.withColumn("col10", when(df["col9"] < 0.5, lit("A")).otherwise(when(df["col9"] > 0.75, lit("C")).otherwise(lit("B"))

I assumed legal values for the columns are 0-1 but if you need to explicitly check them you can simply change the conditions and add an additional when/otherwise for illegal values

Upvotes: 1

Chris Beard
Chris Beard

Reputation: 191

This is a perfect situation for a User-Defined Function (UDF). If you need more flexibility (create more than one column from you input, for instance), then you can look at transformers.

Your UDF would look something like the following:

from pyspark.sql.functions import udf

def colToString(num):
    if num >= 0 and num < 0.5: return 'A'
    elif num >= 0.5 and num < 0.75: return 'B'
    elif num >= 0.75 and num < 1.0: return 'C'
    else: return 'D'

myUdf = udf(colToString, StringType())
df.withColumn("col10", myUdf('col9'))

Here, myUdf takes a parameter which is a double and returns a string. The double value is read from the input column, col9.

The key is to use dataframe operations to perform this, not Pandas. Pandas will not perform your operations in a distributed manner, while Spark will.

Upvotes: 1

akuiper
akuiper

Reputation: 215117

You can use pd.cut() and label the categories the way you wanted:

import pandas as pd
df['col10'] = pd.cut(df['col9'], [0, 0.5, 0.75, 1], labels = list("ABC"))

enter image description here

Upvotes: 4

Related Questions