Chetan
Chetan

Reputation: 21

How can I avoid Cartesian in DataFrame join in Scala?

Please refer following dataframes

I want to get mismatching rows in col2 after matching col1 of both dataframes I am trying following but it's not producing result because seems the Dataframe join is forming Cartesian

val dfs = Seq((1,1),(1,2),(1,3),(2,6)).toDF("col1","col2")
val dft = Seq((1,1),(1,2),(1,4)).toDF("col1","col2") 

dfs.join(dft,"col1").filter(dfs("col2").notEqual(dft("col2"))).show

In above case I expect the join & filter to return result (1,3) But seems it's joining every row of col1 in dfs to every row in col1 on dft thus producing unwanted result

Is the Cartesion as following normal behaviour for Dataframe join or I am missing some setting? how can I get (1,3) as output?

scala> dft.join(dfs,dft("col1")===dfs("col1")).show
+----+----+----+----+
|col1|col2|col1|col2|
+----+----+----+----+
|   1|   1|   1|   3|
|   1|   1|   1|   2|
|   1|   1|   1|   1|
|   1|   2|   1|   3|
|   1|   2|   1|   2|
|   1|   2|   1|   1|
|   1|   4|   1|   3|
|   1|   4|   1|   2|
|   1|   4|   1|   1|
+----+----+----+----+

Thanks chetab

Upvotes: 2

Views: 415

Answers (1)

user6022341
user6022341

Reputation:

This is not Cartesian product. You join by col1 so output contains all combinations of row with matching col1. Result is correct.

Upvotes: 1

Related Questions