Vinod
Vinod

Reputation: 31

Spark dataframe after join null check for integer type columns

I am implemeting solutions for removing duplicate elements from two dataframes using left out join. After performing join condition, I have to check null columns for right table.

val llist = Seq(("bob", "2015-01-13", 4), ("alice", "2015-04-23",10))
val left = llist.toDF("name","date","duration")
val right = Seq(("alice", "2015-04-23",10),("bob", "2015-04-23", 23)).toDF("name","date","duration")


val df = left.join(right , left("name") === right("name") &&
  left("date") === right("date") &&
  left("duration").cast(StringType) === right("duration").cast(StringType)
  ,"left_outer").filter(right("duration").isNull)

But I am unable to filter out integer columns with null values. How can we do null check for integers after join ?

Upvotes: 2

Views: 5494

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

It's rather unclear what you want to achieve. The way you do it creates ambiguous column names. In addition, you reference the original (source) dataframe (right) in the filter condition, not the joined dataframe.

If you want to join them, you can do:

val df = left
      .join(right , Seq("name","date","duration"),"left_outer")

But will not result in any "null" columns because duplicated columns are removed.

+-----+----------+--------+
| name|      date|duration|
+-----+----------+--------+
|  bob|2015-01-13|       4|
|alice|2015-04-23|      10|
+-----+----------+--------+

Otherwise, you can try this:

val df = left.as('left)
  .join(right.as('right) ,
          $"left.name" === $"right.name"
      and $"left.date" === $"right.date"
      and $"left.duration" === $"right.duration"
    ,"left_outer"
  )
  .filter($"right.duration".isNull)

this will result in

+----+----------+--------+----+----+--------+
|name|      date|duration|name|date|duration|
+----+----------+--------+----+----+--------+
| bob|2015-01-13|       4|null|null|    null|
+----+----------+--------+----+----+--------+

EDIT:

If you just want to remove duplicates, you could to this:

val df = left.unionAll(right).distinct()

Upvotes: 1

Related Questions