How to join the two dataframe by condition in PySpark?

Question

I am having two dataframe like described below

Dataframe 1

P_ID     P_Name     P_Description   P_Size

100      Moto          Mobile         16

200      Apple         Mobile         15

300      Oppo          Mobile         18

Dataframe 2

P_ID     List_Code      P_Amount     

100      ALPHA           20000         

100      BETA            60000  

300      GAMMA           15000

Requirement : Need to join the two dataframe by P_ID.

Information about the dataframe : In dataframe 1 P_ID is a primary key and dataframe 2 does't have any primary attribute.

How to join the dataframe Need to create new columns in dataframe 1 from the value of dataframe 2 List_Code appends with "_price". If dataframe 2 List_Code contains 20 unique values we need to create 20 column in dataframe 1. Then, we have fill the value in newly created column in dataframe 1 from the dataframe 2 P_Amount column based on P_ID if present else fills with zero. After creation of dataframe we need to join the dataframe based on the P_ID. If we add the column with the expected value in dataframe 1 we can join the dataframe. My problem is creating new columns with the expected value.

The expected dataframe is shown below

Expected dataframe

   P_ID     P_Name     P_Description   P_Size   ALPHA_price   BETA_price    GAMMA_price

    100      Moto          Mobile         16       20000       60000           0

    200      Apple         Mobile         15         0            0            0

    300      Oppo          Mobile         18         0            0           15000

Can you please help me to solve the problem, thanks in advance.

Sreeram TP · Accepted Answer

For you application, you need to pivot the second dataframe and then join the first dataframe on to the pivoted result on P_ID using left join.

See the code below.

df_1 = pd.DataFrame({'P_ID' : [100, 200, 300], 'P_Name': ['Moto', 'Apple', 'Oppo'], 'P_Size' : [16, 15, 18]})
sdf_1 = sc.createDataFrame(df_1)

df_2 = pd.DataFrame({'P_ID' : [100, 100, 300], 'List_Code': ['ALPHA', 'BETA', 'GAMMA'], 'P_Amount' : [20000, 60000, 10000]})
sdf_2 = sc.createDataFrame(df_2)

sdf_pivoted = sdf_2.groupby('P_ID').pivot('List_Code').agg(f.sum('P_Amount')).fillna(0)



sdf_joined = sdf_1.join(sdf_pivoted, on='P_ID', how='left').fillna(0)
sdf_joined.show()

+----+------+------+-----+-----+-----+
|P_ID|P_Name|P_Size|ALPHA| BETA|GAMMA|
+----+------+------+-----+-----+-----+
| 300|  Oppo|    18|    0|    0|10000|
| 200| Apple|    15|    0|    0|    0|
| 100|  Moto|    16|20000|60000|    0|
+----+------+------+-----+-----+-----+

You can change the column names or ordering of the dataframe as needed.

How to join the two dataframe by condition in PySpark?

Answers (1)

Related Questions