Reputation: 187
I have two dataframes like this:
DF1:
id | name
---
1 | abc
2 | xyz
DF2:
id | course
---
1 | c1
1 | c2
1 | c3
2 | c1
2 | c3
When I do a left_outer or inner join of df1 and df2, I want the resultant dataframe to come as:
id | name | course
---
1 | abc | c1
---
2 | xyz | c1
---
It doesn't matter whether it is c1,c2 or c3 for id 1 when I join; but I need only one record.
Please let me know how can I achieve this in spark.
Thanks, John
Upvotes: 4
Views: 4487
Reputation: 215137
How about dropping all duplicated records based on column id
which will keep only one record for each unique id
and then join it with df1
:
df1.join(df2.dropDuplicates(Seq("id")), Seq("id"), "inner").show
+---+----+------+
| id|name|course|
+---+----+------+
| 1| abc| c1|
| 2| xyz| c1|
+---+----+------+
Upvotes: 5