Evgenii
Evgenii

Reputation: 449

group and collect set if columns are not empty after join

I have a table and 3 dictionaries. The dictionaries have only 1 column each. The table's structure is

time:string, ctn:string, url:string

after joining I get a df with 6 columns

tableDF
.join(enq2, $"url".contains($"q2"), "left_outer")
.join(enq3, $"url".contains($"q3"), "left_outer")
.join(model, $"url".contains($"model"), "left_outer")

What I need is 3 columns from the table and, in case q2, q3, and model are not null, a set of model. Or in case either of q2, q3 or model is null an empty set.

so I group

.groupBy($"url", $"ctn", $"timestamp")

and try to get the set

.withColumn("model",
when($"q2".isNotNull && $"q3".isNotNull && $"model".isNotNull, collect_set($"model")
).otherwise())

which doesn't work

I also tried

.agg(get_set($"q2", $"q3", $"model").alias("model"))

where get_set looks somewhat like

def get_set(q2: ColumnName, q3: ColumnName, model: ColumnName):Column={
if(q2.isNotNull && q3.isNotNull && model.isNotNull)
  collect_set(model)
}

but isNotNull returns Column and I need boolean for this. I'm not sure if ==null is right. Or maybe I should convert column to string and check it.

Any advice?

Upvotes: 1

Views: 1245

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

You should be doing the following

.groupBy($"url", $"ctn", $"timestamp").
agg(
  collect_set(
    when($"q2".isNotNull && $"q3".isNotNull && $"model".isNotNull, $"model").
    otherwise(lit(null))
  )
)

or you can just drop the nulled rows

df_joined.na.drop().groupBy($"url", $"ctn", $"timestamp").
agg(collect_set($"model"))

Upvotes: 1

Related Questions