flck
flck

Reputation: 29

How to join between different elements of two Pyspark dataframes

I have two dataframes named df1 and df2, the content of data dataframe is as follows.

df1:

line_item_usage_account_id  line_item_unblended_cost    name 
100000000001                12.05                       account1
200000000001                52                          account2
300000000003                12.03                       account3

df2:

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52

I need the IDs of df1.line_item_usage_account_id that are not in df2.accountproviderid to be added in the join, something like this:

accountname     accountproviderid   clustername     app_pmo     app_costcenter      line_item_unblended_cost
account1        100000000001        cluster1        111111      11111111            12.05
account2        200000000001        cluster2        222222      22222222            52
account3        300000000003        NA              NA          NA                  12.03

the id "300000000003" from df1.line_item_usage_account_id is not found in df2.accountproviderid, so it was added in the new dataframe.

Any idea how to achieve this? I appreciate any help.

Upvotes: 0

Views: 67

Answers (1)

Cena
Cena

Reputation: 3419

You can use a right join here:

df2.join(df1, (df2.accountproviderid == df1.line_item_usage_account_id), "right")\
    .drop("accountname", "accountproviderid")\
    .withColumnRenamed("line_item_usage_account_id", "accountproviderid")\
    .withColumnRenamed("name", "accountname")\
    .select("accountname", "accountproviderid", "clustername", "app_pmo",\
     "app_costcenter", "line_item_unblended_cost").show()

+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
|   account1|     100000000001|   cluster1| 111111|      11111111|                   12.05|
|   account2|     200000000001|   cluster2| 222222|      22222222|                    52.0|
|   account3|     300000000003|       null|   null|          null|                   12.03|
+-----------+-----------------+-----------+-------+--------------+------------------------+

Upvotes: 1

Related Questions