Kalpesh
Kalpesh

Reputation: 704

identify records in parent dataframe which are not present in subset dataframe in java

I have parent dataframe, lets say having 10 ids, and subset of that dataframe has 4 ids, how to identify remaining 6 ids which are not there in subset dataframe. I don't wanted to run any foreach loop over the input as the input may contain millions of records.

Parent DataFrame :
id
 1
 2
 3
 4
 5

SubsetDataFrame
 id
  1
  2
  3

Required Output is

 id
  4
  5

Upvotes: 0

Views: 83

Answers (2)

Vitalii Kotliarenko
Vitalii Kotliarenko

Reputation: 2967

I guess the simplest way is to use LEFT SEMI JOIN

case class D(id: Int, value: String)
val df1 = sqlContext.createDataFrame(Seq(D(1, "a"), D(2, "b"), D(3, "c")))
val df2 = sqlContext.createDataFrame(Seq(D(1, "a")))
df1.registerTempTable("table1")
df2.registerTempTable("table2")
sqlContext.sql("select * from table1 t1 left semi join table2 t2 on (t1.id != t2.id)").show()
+---+-----+
| id|value|
+---+-----+
|  2|    b|
|  3|    c|
+---+-----+

Upvotes: 0

David Griffin
David Griffin

Reputation: 13927

This is what you want to do.

val parentDf = Seq(Tuple1(1), Tuple1(2), Tuple1(3), Tuple1(4), Tuple1(5)).toDF("id")
val subsetDf = Seq(Tuple1(1), Tuple1(2), Tuple1(3)).toDF("id")

val antiJoined = parentDf.join(
  subsetDf.withColumnRenamed("l_id"),
  $"id" === $"l_id", 
  "left_outer"
).filter($"l_id".isNull).drop("l_id")

antiJoined.show
+---+
| id|
+---+
|  4|
|  5|
+---+

Note that because subsetDf is a subset of parentDf, you only need a left_outer. If you want to find elements missing in either side of the join (i.e. both DataFrames where subsets of a larger superset, and you wanted to find elements in the subsets not present in the other subset) then you would use full_outer instead.

Upvotes: 1

Related Questions