Reputation: 81
so I have the following, and I am trying to use joinWith and I only want whats on the left side and then I want to run a function after that join, I'm doing the following but it doesn't work.
(FYI for clarity The left side has more additional columns in real life which is why i am running the function after the join, just using this as a basic example)
val df1 = Seq((1, "A1"), (2, "A2"), (3, "A3"), (4, "A4")).toDF("id", "value_left")
val df2 = Seq((3, "A3"), (4, "A4"), (4, "A4_1"), (5, "A5"), (6, "A6")).toDF("id", "value_right")
def anotherFunction(parm: Dataset[Row]): Dataset[Row]
val transformedDf: Dataset[Row] = df1
.joinWith(df2, df1("id")===df2("id"),"inner")
.transform(anotherFunction)
the output of the inner join before i try to get the left side looks like this, i only want the left side in this case but when i pass my function through .transform()
it says whats required is a Dataset[(Row, Row)]
INNER JOIN
+---+----------+-----------+
| id|value_left|value_right|
+---+----------+-----------+
| 3| A3| A3|
| 4| A4| A4_1|
| 4| A4| A4|
+---+----------+-----------+
Upvotes: 0
Views: 664
Reputation: 1859
If I understood you correctly, you don't want the actual join, but df1
, filtered by ids in df2
.
Take a look at left_semi
join type. If you change your code to
df1.joinWith(df2, df1("id")===df2("id"),"left_semi")
, you should get something like
LEFT SEMI JOIN
+---+----------+
| id|value_left|
+---+----------+
| 3| A3|
| 4| A4|
Some more info can be found here - https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins.html#joinWith
Also, you can just joined_df.drop("value_right")
Upvotes: 2