KurryF
KurryF

Reputation: 81

How to return only the left side when using joinWith in Scala

so I have the following, and I am trying to use joinWith and I only want whats on the left side and then I want to run a function after that join, I'm doing the following but it doesn't work.

(FYI for clarity The left side has more additional columns in real life which is why i am running the function after the join, just using this as a basic example)

val df1 = Seq((1, "A1"), (2, "A2"), (3, "A3"), (4, "A4")).toDF("id", "value_left")
val df2 = Seq((3, "A3"), (4, "A4"), (4, "A4_1"), (5, "A5"), (6, "A6")).toDF("id", "value_right")

def anotherFunction(parm: Dataset[Row]): Dataset[Row]

 val transformedDf: Dataset[Row] = df1
      .joinWith(df2, df1("id")===df2("id"),"inner")
      .transform(anotherFunction)

the output of the inner join before i try to get the left side looks like this, i only want the left side in this case but when i pass my function through .transform() it says whats required is a Dataset[(Row, Row)]

INNER JOIN
+---+----------+-----------+
| id|value_left|value_right|
+---+----------+-----------+
|  3|        A3|         A3|
|  4|        A4|       A4_1|
|  4|        A4|         A4|
+---+----------+-----------+

Upvotes: 0

Views: 664

Answers (1)

Rayan Ral
Rayan Ral

Reputation: 1859

If I understood you correctly, you don't want the actual join, but df1, filtered by ids in df2. Take a look at left_semi join type. If you change your code to df1.joinWith(df2, df1("id")===df2("id"),"left_semi"), you should get something like

LEFT SEMI JOIN
+---+----------+
| id|value_left|
+---+----------+
|  3|        A3| 
|  4|        A4|

Some more info can be found here - https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html#joinWith

Also, you can just joined_df.drop("value_right")

Upvotes: 2

Related Questions