Nasrul Islam
Nasrul Islam

Reputation: 13

Spark DataFrame's `except()` removes different items every time

var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 =  df.except(df2)
df3.show()

Surprisingly, I found that except is not working the way it should. Here is my output: df2: created correctly, contains 1,2 and 3. But my df3 still has 1, 2 and/or 3 in it. It's kind of random. If I run it multiple times, I get different result. Can anyone please help me? Thanks in advance.

Upvotes: 1

Views: 10234

Answers (3)

Nikunj Kakadiya
Nikunj Kakadiya

Reputation: 2988

One could use a leftanti join for this if you have uniqueness in the column for which you are comparing.

Example:

var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")  
df.show()  
var df2 = df.limit(3)  
df2.show()  
var df3 =  df.join(df2,Seq("num"),"leftanti")  
df3.show()  

Upvotes: 0

marios
marios

Reputation: 8996

Best way to test this is to just create a new DF that has the values you want to diff.

val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = List(1,2,3).toDF("num")
df2.show()
val df3 =  df.except(df2)
df3.show()

Alternatively, just write a deterministic filter to select the rows you want:

val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = df.filter("num <= 3")
df2.show()
val df3 =  df.except(df2)
df3.show()

Upvotes: 0

Amit Kumar
Amit Kumar

Reputation: 1584

You need to put a spark "action" to collect the data that is required for "df2" before performing the "except" operation, which will ensure that the dataframe df2 get computed before hand and has the fixed content which will be subtracted from df.

Randomness is because spark lazy evaluation and spark is putting all your code in one stage. And the contents of "df2" is not fixed when you performed the "except" operation on it. As per the spark function definition for limit:

Returns a new Dataset by taking the first n rows. The difference between this function and head is that head is an action and returns an array (by triggering query execution) while limit returns a new Dataset.

since, it return a datset, will be lazy evaluation,

Below code will give you a consistent output.

var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.head(3).map(f => f.mkString).toList.toDF("num")
df2.show()
var df3 =  df.except(df2)
df3.show()

Upvotes: 3

Related Questions