Meesam
Meesam

Reputation: 51

Filter out struct of null values from an array of structs in spark dataframe

I have a col in a dataframe which is an array of structs. There are some structs with all null values which I would like to filter out. For example with the following dataframe:

+—————+————————————————---------------------—+
|advertiser      |brands                     |
+—————+—————————————---------------------————+
Advertiser1      [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2      [{"id" : null, "type" : null, "name" : null}]
+—————+————————————---------------------—————+

I would like to filter out the struct with the null values to get:

+—————+———————————————---------------------——+
|advertiser      |brands                     |
+—————+————————————---------------------—————+
Advertiser1      [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2      []
+—————+—————————————---------------------————+

I'm thinking it's something along the lines of this if I can come up with a struct of null values:

.withColumn(
    "brands",
    when(
        col("brands").equalTo(*emptyStruct?*),
            null
    )
)

Upvotes: 0

Views: 2758

Answers (2)

Jarrod Baker
Jarrod Baker

Reputation: 1220

You want to filter the array elements. Since Spark 3.0 there is a method in org.apache.spark.sql.functions with the signature:

filter(column: Column, f: Column => Column): Column

which does this for you:

df.select(
  col("advertiser"),
  filter(col("brands"),
         b =>
           b.getField("id").isNotNull
             && b.getField("type").isNotNull
             && b.getField("name").isNotNull) as "brands"

Using Spark 1.6+ you can explode the array, filter the structs and then group by advertiser:

df.select(col("advertiser"), explode(col("brands")) as "b").
  filter(
    col("b.id").isNotNull &&
      col("b.type").isNotNull &&
      col("b.name").isNotNull)
  .groupBy("advertiser").agg(collect_list("b") as "brands")

Upvotes: 0

过过招
过过招

Reputation: 4199

You can try to use the to_json function, brands with all null values will returns [{}].

Upvotes: 1

Related Questions