Spark dataframe replace values of specific columns in a row with Nulls

Question

I am facing a problem when trying to replace the values of specific columns of a Spark dataframe with nulls. I have a dataframe with more than fifty columns of which two are key columns. I want to create a new dataframe with same schema and the new dataframe should have values from the key columns and null values in non-key columns. I tried the following ways but facing issues:

//old_df is the existing Dataframe 
val key_cols = List("id", "key_number")
val non_key_cols = old_df.columns.toList.filterNot(key_cols.contains(_))

val key_col_df = old_df.select(key_cols.head, key_cols.tail:_*)
val non_key_cols_df = old_df.select(non_key_cols.head, non_key_cols.tail:_*)
val list_cols = List.fill(non_key_cols_df.columns.size)("NULL")
val rdd_list_cols = spark.sparkContext.parallelize(Seq(list_cols)).map(l => Row(l:_*))
val list_df = spark.createDataFrame(rdd_list_cols, non_key_cols_df.schema)

val new_df = key_col_df.crossJoin(list_df)

This approach was good when I only have string type columns in the old_df. But I have some columns of double type and int type which is throwing error because the rdd is a list of null strings.

To avoid this I tried the list_df as an empty dataframe with schema as the non_key_cols_df but the result of crossJoin is an empty dataframe which I believe is because one dataframe is empty.

My requirement is to have the non_key_cols as a single row dataframe with Nulls so that I can do crossJoin with key_col_df and form the required new_df.

Also any other easier way to update all columns except key columns of a dataframe to nulls will resolve my issue. Thanks in advance

Shaido · Accepted Answer

crossJoin is an expensive operation so you want to avoid it if possible. An easier solution would be to iterate over all non-key columns and insert null with lit(null). Using foldLeft this can be done as follows:

val keyCols = List("id", "key_number")
val nonKeyCols = df.columns.filterNot(keyCols.contains(_))

val df2 = nonKeyCols.foldLeft(df)((df, c) => df.withColumn(c, lit(null)))

Input example:

+---+----------+---+----+
| id|key_number|  c|   d|
+---+----------+---+----+
|  1|         2|  3| 4.0|
|  5|         6|  7| 8.0|
|  9|        10| 11|12.0|
+---+----------+---+----+

will give:

+---+----------+----+----+
| id|key_number|   c|   d|
+---+----------+----+----+
|  1|         2|null|null|
|  5|         6|null|null|
|  9|        10|null|null|
+---+----------+----+----+

Spark dataframe replace values of specific columns in a row with Nulls

Answers (2)

Related Questions