GenDemo
GenDemo

Reputation: 761

pySpark Replacing Null Value on subsets of rows

I have a pySpark dataframe, where I have null values that I want to replace - however the value to replace with is different for different groups.

My data looks like this (appologies, I dont have a way to past it as text):

enter image description here

For group A I want to replace the null values with -999; while for group B, I want to replace the null value with 0.

Currently, I split the data into sections, then do a df = df.fillna(-999) .

Is there a more efficient way of doing it? in psudo-code I was thinking something along the line of df = df.where(col('group') == A).fillna(lit(-999)).where(col('group') == B).fillna(lit(0)) but ofcourse, this doesn't work.

Upvotes: 3

Views: 584

Answers (2)

Ric S
Ric S

Reputation: 9247

Another possible option is to use coalesce for each column with a "filler" column holding the replacement values:

import pyspark.sql.functions as F

for c in ['Col1', 'Col2', 'Col3']:
  df = df.withColumn(c, F.coalesce(c, F.when(F.col('group') == 'A', -999)
                                       .when(F.col('group') == 'B', 0)))

Upvotes: 1

PieCot
PieCot

Reputation: 3639

You can use when:

from pyspark.sql import functions as F

# Loop over all the columns you want to fill
for col in ('Col1', 'Col2', 'Col3'):
    # compute here conditions to fill using a value or another
    fill_a = F.col(col).isNull() & (F.col('Group') == 'A')
    fill_b = F.col(col).isNull() & (F.col('Group') == 'B')

    # Fill the column based on the different conditions 
    # using nested `when` - `otherwise`.
    #
    # Do not forget to add the last `otherwise` with the original 
    # values if none of the previous conditions have been met
    filled_col = (
        F.when(fill_a, -999)
        .otherwise(
            F.when(fill_b, 0)
            .otherwise(F.col(col))
        )
    )

    # 'overwrite' the original column with the filled column
    df = df.withColumn(col, filled_col)

Upvotes: 2

Related Questions