Siddharth Satpathy
Siddharth Satpathy

Reputation: 3043

Join PySpark dataframes with unequal numbers of rows

I have two PySpark dataframe which are as given underneath

First is df1 which is given below:

+-----+-----+----------+-----+
| name| type|timestamp1|score|
+-----+-----+----------+-----+
|name1|type1|2012-01-10|   11|
|name2|type1|2012-01-10|   14|
|name3|type2|2012-01-10|    2|
|name3|type2|2012-01-17|    3|
|name1|type1|2012-01-18|   55|
|name1|type1|2012-01-19|   10|
+-----+-----+----------+-----+

Second is df2 which is given below:

+-----+-------------------+-------+-------+
| name|         timestamp2|string1|string2|
+-----+-------------------+-------+-------+
|name1|2012-01-10 00:00:00|      A|     aa|
|name2|2012-01-10 00:00:00|      A|     bb|
|name3|2012-01-10 00:00:00|      C|     cc|
|name4|2012-01-17 00:00:00|      D|     dd|
|name3|2012-01-10 00:00:00|      C|     cc|
|name2|2012-01-17 00:00:00|      A|     bb|
|name2|2012-01-17 00:00:00|      A|     bb|
|name4|2012-01-10 00:00:00|      D|     dd|
|name3|2012-01-17 00:00:00|      C|     cc|
+-----+-------------------+-------+-------+

These two dataframes have one common column, i.e. name. Each unique value of name in df2 has unique values of string1 and string2.

I want to join df1 and df2 and form a new dataframe df3 such that df3 contains all the rows of df1 (same structure, numbers of rows as df1) but assigns values from columns string1 and string2 (from df2) to appropriate values of name in df1. Following is how I want the combined dataframe (df3) to look like.

+-----+-----+----------+-----+-------+-------+
| name| type|timestamp1|score|string1|string2|
+-----+-----+----------+-----+-------+-------+
|name1|type1|2012-01-10|   11|      A|     aa|
|name2|type1|2012-01-10|   14|      A|     bb|
|name3|type2|2012-01-10|    2|      C|     cc|
|name3|type2|2012-01-17|    3|      C|     cc|
|name1|type1|2012-01-18|   55|      A|     aa|
|name1|type1|2012-01-19|   10|      A|     aa|
+-----+-----+----------+-----+-------+-------+

How can I do get the above mentioned dataframe (df3)?

I tried the following df3 = df1.join( df2.select("name", "string1", "string2") , on=["name"], how="left"). But that gives me a dataframe with 14 rows with multiple (duplicate) entries of rows.

You can use the below mentioned code to generate df1 and df2.

from pyspark.sql import *
import pyspark.sql.functions as F

df1_Stats = Row("name", "type", "timestamp1", "score")

df1_stat1 = df1_Stats('name1', 'type1', "2012-01-10", 11)
df1_stat2 = df1_Stats('name2', 'type1', "2012-01-10", 14)
df1_stat3 = df1_Stats('name3', 'type2', "2012-01-10", 2)
df1_stat4 = df1_Stats('name3', 'type2', "2012-01-17", 3)
df1_stat5 = df1_Stats('name1', 'type1', "2012-01-18", 55)
df1_stat6 = df1_Stats('name1', 'type1', "2012-01-19", 10)

df1_stat_lst = [df1_stat1 , df1_stat2, df1_stat3, df1_stat4, df1_stat5, df1_stat6]

df1 = spark.createDataFrame(df1_stat_lst)

df2_Stats = Row("name", "timestamp2", "string1", "string2")

df2_stat1 = df2_Stats("name1", "2012-01-10 00:00:00", "A", "aa")
df2_stat2 = df2_Stats("name2", "2012-01-10 00:00:00", "A", "bb")
df2_stat3 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat4 = df2_Stats("name4", "2012-01-17 00:00:00", "D", "dd")
df2_stat5 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat6 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat7 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat8 = df2_Stats("name4", "2012-01-10 00:00:00", "D", "dd")
df2_stat9 = df2_Stats("name3", "2012-01-17 00:00:00", "C", "cc")

df2_stat_lst = [
    df2_stat1,
    df2_stat2,
    df2_stat3,
    df2_stat4,
    df2_stat5,
    df2_stat6,
    df2_stat7,
    df2_stat8,
    df2_stat9,
]

df2 = spark.createDataFrame(df2_stat_lst)

Upvotes: 0

Views: 647

Answers (2)

Kapil
Kapil

Reputation: 166

It would be better to remove duplicates before joining , making small table to join.

 df3 = df1.join(df2.select("name", "string1", "string2").distinct(),on=["name"] , how="left")

Upvotes: 1

Siddharth Satpathy
Siddharth Satpathy

Reputation: 3043

Apparently the following technique does it:

df3 = df1.join(
    df2.select("name", "string1", "string2"), on=["name"], how="left"
).dropDuplicates()
df3.show()

+-----+-----+----------+-----+-------+-------+
| name| type| timestamp|score|string1|string2|
+-----+-----+----------+-----+-------+-------+
|name2|type1|2012-01-10|   14|      A|     bb|
|name3|type2|2012-01-10|    2|      C|     cc|
|name1|type1|2012-01-18|   55|      A|     aa|
|name1|type1|2012-01-10|   11|      A|     aa|
|name3|type2|2012-01-17|    3|      C|     cc|
|name1|type1|2012-01-19|   10|      A|     aa|
+-----+-----+----------+-----+-------+-------+

I am still open for answers. So, if you have a more efficient method of answering the question, please feel free to drop your answer.

Upvotes: 2

Related Questions