addmeaning
addmeaning

Reputation: 1398

How to join multiple columns from one DataFrame with another DataFrame

I have two DataFrames recommendations and movies. Columns rec1-rec3 in recommendations represent movie id from movies dataframe.

val recommendations: DataFrame = List(
        (0, 1, 2, 3),
        (1, 2, 3, 4),
        (2, 1, 3, 4)).toDF("id", "rec1", "rec2", "rec3")

val movies = List(
        (1, "the Lord of the Rings"),
        (2, "Star Wars"),
        (3, "Star Trek"),
        (4, "Pulp Fiction")).toDF("id", "name")

What I want:

+---+------------------------+------------+------------+
| id|                    rec1|        rec2|        rec3|
+---+------------------------+------------+------------+
|  0|   the Lord of the Rings|   Star Wars|   Star Trek|
|  1|               Star Wars|   Star Trek|Pulp Fiction|
|  2|   the Lord of the Rings|   Star Trek|   Star Trek|
+---+------------------------+------------+------------+

Upvotes: 0

Views: 1044

Answers (2)

mtoto
mtoto

Reputation: 24198

We can also use the functions stack() and pivot() to arrive at your expected output, joining the two dataframes only once.

// First rename 'id' column to 'ids' avoid duplicate names further downstream
val moviesRenamed = movies.withColumnRenamed("id", "ids")

recommendations.select($"id", expr("stack(3, 'rec1', rec1, 'rec2', rec2, 'rec3', rec3) as (rec, movie_id)"))
  .where("rec is not null")
  .join(moviesRenamed, col("movie_id") === moviesRenamed.col("ids"))
  .groupBy("id")
  .pivot("rec")
  .agg(first("name"))
  .show()
+---+--------------------+---------+------------+
| id|                rec1|     rec2|        rec3|
+---+--------------------+---------+------------+
|  0|the Lord of the R...|Star Wars|   Star Trek|
|  1|           Star Wars|Star Trek|Pulp Fiction|
|  2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+

Upvotes: 5

addmeaning
addmeaning

Reputation: 1398

I figured it out. You should create aliases for your columns just like in SQL.

  val joined = recommendation
    .join(movies.select(col("id").as("id1"), 'name.as("n1")), 'id1 === recommendation.col("rec1"))
    .join(movies.select(col("id").as("id2"), 'name.as("n2")), 'id2 === recommendation.col("rec2"))
    .join(movies.select(col("id").as("id3"), 'name.as("n3")), 'id3  === recommendation.col("rec3"))
    .select('id, 'n1, 'n2, 'n3)
  joined.show()

Query will result in

+---+--------------------+---------+------------+
| id|                  n1|       n2|          n3|
+---+--------------------+---------+------------+
|  0|the Lord of the R...|Star Wars|   Star Trek|
|  1|           Star Wars|Star Trek|Pulp Fiction|
|  2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+

Upvotes: 1

Related Questions