Spark Dataframe Select Columns After Transformation

Question

I am checking for NULL values for 2 out of 6 columns in my DF. But when I apply the in-built functions and use select the resultant DF does not have the remaining columns. Is there a better way to do without using UDFs.

handle_null_cols = [ 'col1', 'col3' ]

# df_null = df.select([ myFunc(col_name).alias(col_name) for col_name in df.columns ])
df_null = df.select( [ myFunc(col_name).alias(col_name) for col_name in handle_null_cols ])

df_null.printSchema() # Resultant DF has only 2 columns selected

col1:int
col3:int

Need to reuse the same DF df_null to do some more transformations downstream with all the columns originally in df.

Alper t. Turker · Accepted Answer

Why won't you do something like this?

df.select([
    myFunc(col_name).alias(col_name) if col_name in handle_null_cols
    else col_name
    for col_name in df.columns
])

reduce + withColumn is more cryptic but viable solution:

from functools import reduce

reduce(
    lambda df, col_name: df.withColumn(col_name, myFunc(col_name)), 
    handle_null_cols,
    df)

But it sounds a bit like you actually want na functions:

df.na.fill(0, subset=handle_null_cols)

Spark Dataframe Select Columns After Transformation

Answers (2)

Related Questions