Reputation: 13
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.except(df2)
df3.show()
Surprisingly, I found that except is not working the way it should. Here is my output: df2: created correctly, contains 1,2 and 3. But my df3 still has 1, 2 and/or 3 in it. It's kind of random. If I run it multiple times, I get different result. Can anyone please help me? Thanks in advance.
Upvotes: 1
Views: 10234
Reputation: 2988
One could use a leftanti join for this if you have uniqueness in the column for which you are comparing.
Example:
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.join(df2,Seq("num"),"leftanti")
df3.show()
Upvotes: 0
Reputation: 8996
Best way to test this is to just create a new DF that has the values you want to diff.
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = List(1,2,3).toDF("num")
df2.show()
val df3 = df.except(df2)
df3.show()
Alternatively, just write a deterministic filter to select the rows you want:
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = df.filter("num <= 3")
df2.show()
val df3 = df.except(df2)
df3.show()
Upvotes: 0
Reputation: 1584
You need to put a spark "action" to collect the data that is required for "df2
" before performing the "except" operation, which will ensure that the dataframe df2
get computed before hand and has the fixed content which will be subtracted from df
.
Randomness is because spark lazy evaluation and spark is putting all your code in one stage. And the contents of "df2" is not fixed when you performed the "except" operation on it. As per the spark function definition for limit:
Returns a new Dataset by taking the first
n
rows. The difference between this function andhead
is thathead
is an action and returns an array (by triggering query execution) whilelimit
returns a new Dataset.
since, it return a datset, will be lazy evaluation,
Below code will give you a consistent output.
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.head(3).map(f => f.mkString).toList.toDF("num")
df2.show()
var df3 = df.except(df2)
df3.show()
Upvotes: 3