Join PySpark dataframes with unequal numbers of rows

Question

I have two PySpark dataframe which are as given underneath

First is df1 which is given below:

+-----+-----+----------+-----+
| name| type|timestamp1|score|
+-----+-----+----------+-----+
|name1|type1|2012-01-10|   11|
|name2|type1|2012-01-10|   14|
|name3|type2|2012-01-10|    2|
|name3|type2|2012-01-17|    3|
|name1|type1|2012-01-18|   55|
|name1|type1|2012-01-19|   10|
+-----+-----+----------+-----+

Second is df2 which is given below:

+-----+-------------------+-------+-------+
| name|         timestamp2|string1|string2|
+-----+-------------------+-------+-------+
|name1|2012-01-10 00:00:00|      A|     aa|
|name2|2012-01-10 00:00:00|      A|     bb|
|name3|2012-01-10 00:00:00|      C|     cc|
|name4|2012-01-17 00:00:00|      D|     dd|
|name3|2012-01-10 00:00:00|      C|     cc|
|name2|2012-01-17 00:00:00|      A|     bb|
|name2|2012-01-17 00:00:00|      A|     bb|
|name4|2012-01-10 00:00:00|      D|     dd|
|name3|2012-01-17 00:00:00|      C|     cc|
+-----+-------------------+-------+-------+

These two dataframes have one common column, i.e. name. Each unique value of name in df2 has unique values of string1 and string2.

I want to join df1 and df2 and form a new dataframe df3 such that df3 contains all the rows of df1 (same structure, numbers of rows as df1) but assigns values from columns string1 and string2 (from df2) to appropriate values of name in df1. Following is how I want the combined dataframe (df3) to look like.

+-----+-----+----------+-----+-------+-------+
| name| type|timestamp1|score|string1|string2|
+-----+-----+----------+-----+-------+-------+
|name1|type1|2012-01-10|   11|      A|     aa|
|name2|type1|2012-01-10|   14|      A|     bb|
|name3|type2|2012-01-10|    2|      C|     cc|
|name3|type2|2012-01-17|    3|      C|     cc|
|name1|type1|2012-01-18|   55|      A|     aa|
|name1|type1|2012-01-19|   10|      A|     aa|
+-----+-----+----------+-----+-------+-------+

How can I do get the above mentioned dataframe (df3)?

I tried the following df3 = df1.join( df2.select("name", "string1", "string2") , on=["name"], how="left"). But that gives me a dataframe with 14 rows with multiple (duplicate) entries of rows.

You can use the below mentioned code to generate df1 and df2.

from pyspark.sql import *
import pyspark.sql.functions as F

df1_Stats = Row("name", "type", "timestamp1", "score")

df1_stat1 = df1_Stats('name1', 'type1', "2012-01-10", 11)
df1_stat2 = df1_Stats('name2', 'type1', "2012-01-10", 14)
df1_stat3 = df1_Stats('name3', 'type2', "2012-01-10", 2)
df1_stat4 = df1_Stats('name3', 'type2', "2012-01-17", 3)
df1_stat5 = df1_Stats('name1', 'type1', "2012-01-18", 55)
df1_stat6 = df1_Stats('name1', 'type1', "2012-01-19", 10)

df1_stat_lst = [df1_stat1 , df1_stat2, df1_stat3, df1_stat4, df1_stat5, df1_stat6]

df1 = spark.createDataFrame(df1_stat_lst)

df2_Stats = Row("name", "timestamp2", "string1", "string2")

df2_stat1 = df2_Stats("name1", "2012-01-10 00:00:00", "A", "aa")
df2_stat2 = df2_Stats("name2", "2012-01-10 00:00:00", "A", "bb")
df2_stat3 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat4 = df2_Stats("name4", "2012-01-17 00:00:00", "D", "dd")
df2_stat5 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat6 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat7 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat8 = df2_Stats("name4", "2012-01-10 00:00:00", "D", "dd")
df2_stat9 = df2_Stats("name3", "2012-01-17 00:00:00", "C", "cc")

df2_stat_lst = [
    df2_stat1,
    df2_stat2,
    df2_stat3,
    df2_stat4,
    df2_stat5,
    df2_stat6,
    df2_stat7,
    df2_stat8,
    df2_stat9,
]

df2 = spark.createDataFrame(df2_stat_lst)

Kapil · Accepted Answer

It would be better to remove duplicates before joining , making small table to join.

 df3 = df1.join(df2.select("name", "string1", "string2").distinct(),on=["name"] , how="left")

Join PySpark dataframes with unequal numbers of rows

Answers (2)

Related Questions