Reputation: 449
I have a table and 3 dictionaries. The dictionaries have only 1 column each. The table's structure is
time:string, ctn:string, url:string
after joining I get a df with 6 columns
tableDF
.join(enq2, $"url".contains($"q2"), "left_outer")
.join(enq3, $"url".contains($"q3"), "left_outer")
.join(model, $"url".contains($"model"), "left_outer")
What I need is 3 columns from the table and, in case q2, q3, and model
are not null, a set of model
. Or in case either of q2, q3 or model
is null an empty set.
so I group
.groupBy($"url", $"ctn", $"timestamp")
and try to get the set
.withColumn("model",
when($"q2".isNotNull && $"q3".isNotNull && $"model".isNotNull, collect_set($"model")
).otherwise())
which doesn't work
I also tried
.agg(get_set($"q2", $"q3", $"model").alias("model"))
where get_set looks somewhat like
def get_set(q2: ColumnName, q3: ColumnName, model: ColumnName):Column={
if(q2.isNotNull && q3.isNotNull && model.isNotNull)
collect_set(model)
}
but isNotNull
returns Column
and I need boolean
for this. I'm not sure if ==null
is right. Or maybe I should convert column to string and check it.
Any advice?
Upvotes: 1
Views: 1245
Reputation: 41957
You should be doing the following
.groupBy($"url", $"ctn", $"timestamp").
agg(
collect_set(
when($"q2".isNotNull && $"q3".isNotNull && $"model".isNotNull, $"model").
otherwise(lit(null))
)
)
or you can just drop
the nulled rows
df_joined.na.drop().groupBy($"url", $"ctn", $"timestamp").
agg(collect_set($"model"))
Upvotes: 1