Reputation: 3043
I have two PySpark dataframe which are as given underneath
First is df1
which is given below:
+-----+-----+----------+-----+
| name| type|timestamp1|score|
+-----+-----+----------+-----+
|name1|type1|2012-01-10| 11|
|name2|type1|2012-01-10| 14|
|name3|type2|2012-01-10| 2|
|name3|type2|2012-01-17| 3|
|name1|type1|2012-01-18| 55|
|name1|type1|2012-01-19| 10|
+-----+-----+----------+-----+
Second is df2
which is given below:
+-----+-------------------+-------+-------+
| name| timestamp2|string1|string2|
+-----+-------------------+-------+-------+
|name1|2012-01-10 00:00:00| A| aa|
|name2|2012-01-10 00:00:00| A| bb|
|name3|2012-01-10 00:00:00| C| cc|
|name4|2012-01-17 00:00:00| D| dd|
|name3|2012-01-10 00:00:00| C| cc|
|name2|2012-01-17 00:00:00| A| bb|
|name2|2012-01-17 00:00:00| A| bb|
|name4|2012-01-10 00:00:00| D| dd|
|name3|2012-01-17 00:00:00| C| cc|
+-----+-------------------+-------+-------+
These two dataframes have one common column, i.e. name
. Each unique value of name
in df2
has unique values of string1
and string2
.
I want to join df1
and df2
and form a new dataframe df3
such that df3
contains all the rows of df1
(same structure, numbers of rows as df1
) but assigns values from columns string1
and string2
(from df2
) to appropriate values of name
in df1
. Following is how I want the combined dataframe (df3
) to look like.
+-----+-----+----------+-----+-------+-------+
| name| type|timestamp1|score|string1|string2|
+-----+-----+----------+-----+-------+-------+
|name1|type1|2012-01-10| 11| A| aa|
|name2|type1|2012-01-10| 14| A| bb|
|name3|type2|2012-01-10| 2| C| cc|
|name3|type2|2012-01-17| 3| C| cc|
|name1|type1|2012-01-18| 55| A| aa|
|name1|type1|2012-01-19| 10| A| aa|
+-----+-----+----------+-----+-------+-------+
How can I do get the above mentioned dataframe (df3
)?
I tried the following df3 = df1.join( df2.select("name", "string1", "string2") , on=["name"], how="left")
. But that gives me a dataframe with 14 rows with multiple (duplicate) entries of rows.
You can use the below mentioned code to generate df1
and df2
.
from pyspark.sql import *
import pyspark.sql.functions as F
df1_Stats = Row("name", "type", "timestamp1", "score")
df1_stat1 = df1_Stats('name1', 'type1', "2012-01-10", 11)
df1_stat2 = df1_Stats('name2', 'type1', "2012-01-10", 14)
df1_stat3 = df1_Stats('name3', 'type2', "2012-01-10", 2)
df1_stat4 = df1_Stats('name3', 'type2', "2012-01-17", 3)
df1_stat5 = df1_Stats('name1', 'type1', "2012-01-18", 55)
df1_stat6 = df1_Stats('name1', 'type1', "2012-01-19", 10)
df1_stat_lst = [df1_stat1 , df1_stat2, df1_stat3, df1_stat4, df1_stat5, df1_stat6]
df1 = spark.createDataFrame(df1_stat_lst)
df2_Stats = Row("name", "timestamp2", "string1", "string2")
df2_stat1 = df2_Stats("name1", "2012-01-10 00:00:00", "A", "aa")
df2_stat2 = df2_Stats("name2", "2012-01-10 00:00:00", "A", "bb")
df2_stat3 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat4 = df2_Stats("name4", "2012-01-17 00:00:00", "D", "dd")
df2_stat5 = df2_Stats("name3", "2012-01-10 00:00:00", "C", "cc")
df2_stat6 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat7 = df2_Stats("name2", "2012-01-17 00:00:00", "A", "bb")
df2_stat8 = df2_Stats("name4", "2012-01-10 00:00:00", "D", "dd")
df2_stat9 = df2_Stats("name3", "2012-01-17 00:00:00", "C", "cc")
df2_stat_lst = [
df2_stat1,
df2_stat2,
df2_stat3,
df2_stat4,
df2_stat5,
df2_stat6,
df2_stat7,
df2_stat8,
df2_stat9,
]
df2 = spark.createDataFrame(df2_stat_lst)
Upvotes: 0
Views: 647
Reputation: 166
It would be better to remove duplicates before joining , making small table to join.
df3 = df1.join(df2.select("name", "string1", "string2").distinct(),on=["name"] , how="left")
Upvotes: 1
Reputation: 3043
Apparently the following technique does it:
df3 = df1.join(
df2.select("name", "string1", "string2"), on=["name"], how="left"
).dropDuplicates()
df3.show()
+-----+-----+----------+-----+-------+-------+
| name| type| timestamp|score|string1|string2|
+-----+-----+----------+-----+-------+-------+
|name2|type1|2012-01-10| 14| A| bb|
|name3|type2|2012-01-10| 2| C| cc|
|name1|type1|2012-01-18| 55| A| aa|
|name1|type1|2012-01-10| 11| A| aa|
|name3|type2|2012-01-17| 3| C| cc|
|name1|type1|2012-01-19| 10| A| aa|
+-----+-----+----------+-----+-------+-------+
I am still open for answers. So, if you have a more efficient method of answering the question, please feel free to drop your answer.
Upvotes: 2