Python Spark join two dataframes and fill column

Question

I have two dataframes that need to be joined in a particular way I am struggling with.

dataframe 1:

+--------------------+---------+----------------+
|        asset_domain|      eid|             oid|
+--------------------+---------+----------------+
|      test-domain...|   126656|          126656|
|    nebraska.aaa.com|   335660|          335660|
|         netflix.com|      460|             460|
+--------------------+---------+----------------+

dataframe 2:

+--------------------+--------------------+---------+--------------+----+----+------------+
|               asset|        asset_domain|dns_count|            ip|  ev|post|form_present|
+--------------------+--------------------+---------+--------------+----+----+------------+
| sub1.test-domain...|      test-domain...|     6354| 11.11.111.111|   1|   1|        null|
|         netflix.com|         netflix.com|     3836| 22.22.222.222|null|null|        null|
+--------------------+--------------------+---------+--------------+----+----+------------+

desired result:

+--------------------+---------+-------------+----+----+------------+---------+----------------+
|               asset|dns_count|           ip|  ev|post|form_present|      eid|             oid|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
|         netflix.com|     3836|22.22.222.222|null|null|        null|      460|             460|
| sub1.test-domain...|     5924|111.11.111.11|   1|   1|        null|   126656|          126656|
|    nebraska.aaa.com|     null|         null|null|null|        null|   335660|          335660|
+--------------------+---------+-------------+----+----+------------+---------+----------------+

Basically – it should join df1 and df2 on asset_domain but if that doesn't exist in df2, then the resulting asset should be the asset_domain from df1.

I tried df = df2.join(df1, ["asset_domain"], "right").drop("asset_domain") but that obviously leaves null in the asset column for nebraska.aaa.com since it does not have a matching domain in df2. How do I go about adding those to the asset column for this particular case?

notNull · Accepted Answer

you can use coalesce function after join to create asset column.

df2.join(df1, ["asset_domain"], "right").select(coalesce("asset","asset_domain").alias("asset"),"dns_count","ip","ev","post","form_present","eid","oid").orderBy("asset").show()
#+----------------+---------+-------------+----+----+------------+------+------+
#|           asset|dns_count|           ip|  ev|post|form_present|   eid|   oid|
#+----------------+---------+-------------+----+----+------------+------+------+
#|nebraska.aaa.com|     null|         null|null|null|        null|335660|335660|
#|     netflix.com|     3836|22.22.222.222|null|null|        None|   460|   460|
#|sub1.test-domain|     6354|11.11.111.111|   1|   1|        null|126656|126656|
#+----------------+---------+-------------+----+----+------------+------+------+

Python Spark join two dataframes and fill column

Answers (2)

Related Questions