Reputation: 704
I have parent dataframe, lets say having 10 ids, and subset of that dataframe has 4 ids, how to identify remaining 6 ids which are not there in subset dataframe. I don't wanted to run any foreach loop over the input as the input may contain millions of records.
Parent DataFrame :
id
1
2
3
4
5
SubsetDataFrame
id
1
2
3
Required Output is
id
4
5
Upvotes: 0
Views: 83
Reputation: 2967
I guess the simplest way is to use LEFT SEMI JOIN
case class D(id: Int, value: String)
val df1 = sqlContext.createDataFrame(Seq(D(1, "a"), D(2, "b"), D(3, "c")))
val df2 = sqlContext.createDataFrame(Seq(D(1, "a")))
df1.registerTempTable("table1")
df2.registerTempTable("table2")
sqlContext.sql("select * from table1 t1 left semi join table2 t2 on (t1.id != t2.id)").show()
+---+-----+
| id|value|
+---+-----+
| 2| b|
| 3| c|
+---+-----+
Upvotes: 0
Reputation: 13927
This is what you want to do.
val parentDf = Seq(Tuple1(1), Tuple1(2), Tuple1(3), Tuple1(4), Tuple1(5)).toDF("id")
val subsetDf = Seq(Tuple1(1), Tuple1(2), Tuple1(3)).toDF("id")
val antiJoined = parentDf.join(
subsetDf.withColumnRenamed("l_id"),
$"id" === $"l_id",
"left_outer"
).filter($"l_id".isNull).drop("l_id")
antiJoined.show
+---+
| id|
+---+
| 4|
| 5|
+---+
Note that because subsetDf
is a subset of parentDf
, you only need a left_outer
. If you want to find elements missing in either side of the join (i.e. both DataFrames
where subsets of a larger superset, and you wanted to find elements in the subsets not present in the other subset) then you would use full_outer
instead.
Upvotes: 1