Reputation: 1647
I have below scenario:
I have 2 dataframes containing only 1 column Lets say
DF1=(1,2,3,4,5)
DF2=(3,6,7,8,9,10)
Basically those values are keys and I am creating a parquet file of DF1 if the keys in DF1 are not in DF2 (In current example it should return false). My current way of achieving my requirement is:
val df1count= DF1.count
val df2count=DF2.count
val diffDF=DF2.except(DF1)
val diffCount=diffDF.count
if(diffCount==(df2count-df1count)) true
else false
The problem with this approach is I am calling action elements 4 times which is for sure not the best way. Can someone suggest me the best effective way of achieving this?
Upvotes: 1
Views: 12318
Reputation: 89
Try an intersection combined with a count this would assure the the contents are the same and the number of values in both are the same and asserts to a true
val intersectcount= DF1.intersect(DF2).count()
val check =(intersectcount == DF1.count()) && (intersectcount==DF2.count())
Upvotes: 0
Reputation: 1973
Here is the check if Dataset first ist equal to Dataset second:
if(first.except(second).union(second.except(first)).count() == 0)
first == second
else
first != second
Upvotes: 2
Reputation: 37832
You can use intersect
to get the values common to both DataFrames, and then check if it's empty:
DF1.intersect(DF2).take(1).isEmpty
That will use only one action (take(1)
) and a fairly quick one.
Upvotes: 9