Spark (Scala) sqlDataFrame Joins not working as expected

Question

I thought i knew sql joins, but now i'm not so sure about that. I have a dataframe with movie ratings and another dataframe with userIds and their indexes. I want to join both dataframes so that i will have the corresponding user index for every movie rating. However after joining the tables i get more records than i had before the join which makes no sense to me. I expect to get the same amount of records but with an extra column of u_number: My first idea was to use Left join with ratingsDf as the left and userDataFrame as the right but i get undesired results with any of the joins i tried.

The command i use for the join :

val ratingsUsers = ratingsDf.join(userDataFrame, ratingsDf("uid") === userDataFrame("uid"),"left" )

These are the tables :

scala> ratingsDf.show(5)
+--------------+----------+------+
|           uid|       mid|rating|
+--------------+----------+------+
|A1V0C9SDO4DKLA|B0002IQNAG|   4.0|
|A38WAOQVVWOVEY|B0002IQNAG|   4.0|
|A2JP0URFHXP6DO|B0002IQNAG|   5.0|
|A2X4HJ26YWTGJU|B0002IQNAG|   5.0|
|A3A98961GZKIGD|B0002IQNAG|   5.0|
+--------------+----------+------+
scala> userDataFrame.show(5)
+--------------+--------+
|           uid|u_number|
+--------------+--------+
|A10049L7AJW9M7|       0|
|A1007G0226CSWC|       1|
|A100FQCUCZO2WG|       2|
|A100JCBNALJFAW|       3|
|A100K3KEMSVSCM|       4|
+--------------+--------+

Spark (Scala) sqlDataFrame Joins not working as expected

Answers (1)

Related Questions