Reputation: 69
I am checking for NULL values for 2 out of 6 columns in my DF. But when I apply the in-built functions and use select the resultant DF does not have the remaining columns. Is there a better way to do without using UDFs.
handle_null_cols = [ 'col1', 'col3' ]
# df_null = df.select([ myFunc(col_name).alias(col_name) for col_name in df.columns ])
df_null = df.select( [ myFunc(col_name).alias(col_name) for col_name in handle_null_cols ])
df_null.printSchema() # Resultant DF has only 2 columns selected
col1:int
col3:int
Need to reuse the same DF df_null
to do some more transformations downstream with all the columns originally in df
.
Upvotes: 1
Views: 410
Reputation: 69
I think i figured it out based on @user9613318 insights. More easy on the eye. And performance efficient as well?
handle_null_cols = [ 'col1', 'col3' ]
df_null = ( df.select(*[myFunc(col).alias(col)
if col in handle_null_cols else col for col in df.columns]))
Upvotes: 0
Reputation: 35249
Why won't you do something like this?
df.select([
myFunc(col_name).alias(col_name) if col_name in handle_null_cols
else col_name
for col_name in df.columns
])
reduce
+ withColumn
is more cryptic but viable solution:
from functools import reduce
reduce(
lambda df, col_name: df.withColumn(col_name, myFunc(col_name)),
handle_null_cols,
df)
But it sounds a bit like you actually want na
functions:
df.na.fill(0, subset=handle_null_cols)
Upvotes: 2