Reputation: 29
I have two dataframes named df1 and df2, the content of data dataframe is as follows.
df1:
line_item_usage_account_id line_item_unblended_cost name
100000000001 12.05 account1
200000000001 52 account2
300000000003 12.03 account3
df2:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
I need the IDs of df1.line_item_usage_account_id that are not in df2.accountproviderid to be added in the join, something like this:
accountname accountproviderid clustername app_pmo app_costcenter line_item_unblended_cost
account1 100000000001 cluster1 111111 11111111 12.05
account2 200000000001 cluster2 222222 22222222 52
account3 300000000003 NA NA NA 12.03
the id "300000000003" from df1.line_item_usage_account_id is not found in df2.accountproviderid, so it was added in the new dataframe.
Any idea how to achieve this? I appreciate any help.
Upvotes: 0
Views: 67
Reputation: 3419
You can use a right join
here:
df2.join(df1, (df2.accountproviderid == df1.line_item_usage_account_id), "right")\
.drop("accountname", "accountproviderid")\
.withColumnRenamed("line_item_usage_account_id", "accountproviderid")\
.withColumnRenamed("name", "accountname")\
.select("accountname", "accountproviderid", "clustername", "app_pmo",\
"app_costcenter", "line_item_unblended_cost").show()
+-----------+-----------------+-----------+-------+--------------+------------------------+
|accountname|accountproviderid|clustername|app_pmo|app_costcenter|line_item_unblended_cost|
+-----------+-----------------+-----------+-------+--------------+------------------------+
| account1| 100000000001| cluster1| 111111| 11111111| 12.05|
| account2| 200000000001| cluster2| 222222| 22222222| 52.0|
| account3| 300000000003| null| null| null| 12.03|
+-----------+-----------------+-----------+-------+--------------+------------------------+
Upvotes: 1