Reputation: 51
I have a col in a dataframe which is an array of structs. There are some structs with all null values which I would like to filter out. For example with the following dataframe:
+—————+————————————————---------------------—+
|advertiser |brands |
+—————+—————————————---------------------————+
Advertiser1 [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2 [{"id" : null, "type" : null, "name" : null}]
+—————+————————————---------------------—————+
I would like to filter out the struct with the null values to get:
+—————+———————————————---------------------——+
|advertiser |brands |
+—————+————————————---------------------—————+
Advertiser1 [{"id" : "a", "type" : "b", "name" : "c"}]
Advertiser2 []
+—————+—————————————---------------------————+
I'm thinking it's something along the lines of this if I can come up with a struct of null values:
.withColumn(
"brands",
when(
col("brands").equalTo(*emptyStruct?*),
null
)
)
Upvotes: 0
Views: 2758
Reputation: 1220
You want to filter the array elements. Since Spark 3.0 there is a method in org.apache.spark.sql.functions
with the signature:
filter(column: Column, f: Column => Column): Column
which does this for you:
df.select(
col("advertiser"),
filter(col("brands"),
b =>
b.getField("id").isNotNull
&& b.getField("type").isNotNull
&& b.getField("name").isNotNull) as "brands"
Using Spark 1.6+ you can explode the array, filter the structs and then group by advertiser
:
df.select(col("advertiser"), explode(col("brands")) as "b").
filter(
col("b.id").isNotNull &&
col("b.type").isNotNull &&
col("b.name").isNotNull)
.groupBy("advertiser").agg(collect_list("b") as "brands")
Upvotes: 0
Reputation: 4199
You can try to use the to_json function, brands with all null values will returns [{}].
Upvotes: 1