Using pyspark, how to add a column to a DataFrame as a key-value map of multiple known columns in the same DataFrame excluding nulls?

Question

Given the following example:

d = [{'asset': '2', 'ts': 6,  'B':'123','C':'234'}, 
     {'asset': '1', 'ts': 5, 'C.1':'999', 'B':'888','F':'999'}]
df = spark.createDataFrame(d)
df.show(truncate=False)

+---+----+-----+---+----+----+
|B  |C   |asset|ts |C.1 |F   |
+---+----+-----+---+----+----+
|123|234 |2    |6  |null|null|
|888|null|1    |5  |999 |999 |
+---+----+-----+---+----+----+

I want to create the following output:

+-----+---+--------------------------------+
|asset|ts |signals                         |
+-----+---+--------------------------------+
|2    |6  |[B -> 123, C -> 234]            |
|1    |5  |[B -> 888, C.1 -> 999, F -> 999]|
+-----+---+--------------------------------+

I tried the following:

from itertools import chain
from pyspark.sql.functions import *
all_signals=['B','C','C.1','F']
key_values = create_map(*(chain(*[(lit(name), col("`"+name+"`"))
                                  for name in all_signals])))

new_df = df.withColumn('signals',key_values).drop(*all_signals).show(truncate=False)

+-----+---+--------------------------------------+
|asset|ts |signals                               |
+-----+---+--------------------------------------+
|2    |6  |[B -> 123, C -> 234, C.1 ->, F ->]    |
|1    |5  |[B -> 888, C ->, C.1 -> 999, F -> 999]|
+-----+---+--------------------------------------+

But I don't want the keys with null values. So I tried so many ways to exclude null or None. I tried "if" conditions, when/otherwise but none seem to work. Here is one attempt:

key_values = create_map(*(chain(*[(lit(name), col("`"+name+"`")) 
                                  for name in all_signals 
                                  if col("`"+name+"`").isNotNull()])))
new_df = df.withColumn('signals',key_values).drop(*all_signals).show(truncate=False)


ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

I got it to work using a circular way which I am not happy with:

new_df= df.withColumn("signals", from_json(
                       to_json(struct(["`"+x+"`" for x in all_signals])),"MAP"))
                      
new_df = new_df.drop(*all_signals)
new_df.show(truncate=False)

+-----+---+--------------------------------+
|asset|ts |signals                         |
+-----+---+--------------------------------+
|2    |6  |[B -> 123, C -> 234]            |
|1    |5  |[B -> 888, C.1 -> 999, F -> 999]|
+-----+---+--------------------------------+

But there must be a way to exclude null without going to json and back!

murtihash · Accepted Answer

No need for UDF, Use higher order function filter, with arrays_zip and map_from_entries to get your desired output.(spark2.4+)

from pyspark.sql import functions as F

all_singals=['B','C','C.1','F']

df.withColumn("all_one", F.array(*[F.lit(x) for x in all_signals]))\
  .withColumn("all_two", F.array(*["`"+x+"`" for x in all_signals]))\
  .withColumn("signals", F.expr("""map_from_entries(filter(arrays_zip(all_one,all_two),x-> x.all_two is not null))"""))\
  .drop("all_one","all_two").show(truncate=False)

#+---+----+-----+---+----+----+--------------------------------+
#|B  |C   |asset|ts |C.1 |F   |signals                         |
#+---+----+-----+---+----+----+--------------------------------+
#|123|234 |2    |6  |null|null|[B -> 123, C -> 234]            |
#|888|null|1    |5  |999 |999 |[B -> 888, C.1 -> 999, F -> 999]|
#+---+----+-----+---+----+----+--------------------------------+

Using pyspark, how to add a column to a DataFrame as a key-value map of multiple known columns in the same DataFrame excluding nulls?

Answers (2)

Related Questions