Reputation: 261
I thought i knew sql joins, but now i'm not so sure about that. I have a dataframe with movie ratings and another dataframe with userIds and their indexes. I want to join both dataframes so that i will have the corresponding user index for every movie rating. However after joining the tables i get more records than i had before the join which makes no sense to me. I expect to get the same amount of records but with an extra column of u_number: My first idea was to use Left join with ratingsDf as the left and userDataFrame as the right but i get undesired results with any of the joins i tried.
The command i use for the join :
val ratingsUsers = ratingsDf.join(userDataFrame, ratingsDf("uid") === userDataFrame("uid"),"left" )
These are the tables :
scala> ratingsDf.show(5)
+--------------+----------+------+
| uid| mid|rating|
+--------------+----------+------+
|A1V0C9SDO4DKLA|B0002IQNAG| 4.0|
|A38WAOQVVWOVEY|B0002IQNAG| 4.0|
|A2JP0URFHXP6DO|B0002IQNAG| 5.0|
|A2X4HJ26YWTGJU|B0002IQNAG| 5.0|
|A3A98961GZKIGD|B0002IQNAG| 5.0|
+--------------+----------+------+
scala> userDataFrame.show(5)
+--------------+--------+
| uid|u_number|
+--------------+--------+
|A10049L7AJW9M7| 0|
|A1007G0226CSWC| 1|
|A100FQCUCZO2WG| 2|
|A100JCBNALJFAW| 3|
|A100K3KEMSVSCM| 4|
+--------------+--------+
Upvotes: 0
Views: 567
Reputation: 261
So the issue was indeed a problem with duplicate keys in the UserDataFrame. The issue was i used .distinct() on the users rdd which had (k,v) tuples and i thought distinct() worked on keys only, but it takes the whole tuple into consideration which left me with duplicate keys in the dataframe created from that rdd.
Thanks for the help.
Upvotes: 1