AJwr
AJwr

Reputation: 618

Python Spark join two dataframes and fill column

I have two dataframes that need to be joined in a particular way I am struggling with.

dataframe 1:

+--------------------+---------+----------------+
|        asset_domain|      eid|             oid|
+--------------------+---------+----------------+
|      test-domain...|   126656|          126656|
|    nebraska.aaa.com|   335660|          335660|
|         netflix.com|      460|             460|
+--------------------+---------+----------------+

dataframe 2:

+--------------------+--------------------+---------+--------------+----+----+------------+
|               asset|        asset_domain|dns_count|            ip|  ev|post|form_present|
+--------------------+--------------------+---------+--------------+----+----+------------+
| sub1.test-domain...|      test-domain...|     6354| 11.11.111.111|   1|   1|        null|
|         netflix.com|         netflix.com|     3836| 22.22.222.222|null|null|        null|
+--------------------+--------------------+---------+--------------+----+----+------------+

desired result:

+--------------------+---------+-------------+----+----+------------+---------+----------------+
|               asset|dns_count|           ip|  ev|post|form_present|      eid|             oid|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
|         netflix.com|     3836|22.22.222.222|null|null|        null|      460|             460|
| sub1.test-domain...|     5924|111.11.111.11|   1|   1|        null|   126656|          126656|
|    nebraska.aaa.com|     null|         null|null|null|        null|   335660|          335660|
+--------------------+---------+-------------+----+----+------------+---------+----------------+

Basically – it should join df1 and df2 on asset_domain but if that doesn't exist in df2, then the resulting asset should be the asset_domain from df1.

I tried df = df2.join(df1, ["asset_domain"], "right").drop("asset_domain") but that obviously leaves null in the asset column for nebraska.aaa.com since it does not have a matching domain in df2. How do I go about adding those to the asset column for this particular case?

Upvotes: 2

Views: 626

Answers (2)

Raghu
Raghu

Reputation: 1712

After the join you can use the isNull() function

import pyspark.sql.functions as F
tst1 = sqlContext.createDataFrame([('netflix',1),('amazon',2)],schema=("asset_domain",'xtra1'))
tst2= sqlContext.createDataFrame([('netflix','yahoo',1),('amazon','yahoo',2),('flipkart',None,2)],schema=("asset_domain","asset",'xtra'))
tst_j = tst1.join(tst2,on='asset_domain',how='right')
#%%
tst_res = tst_j.withColumn("asset",F.when(F.col('asset').isNull(),F.col('asset_domain')).otherwise(F.col('asset')))

Upvotes: 0

notNull
notNull

Reputation: 31540

you can use coalesce function after join to create asset column.

df2.join(df1, ["asset_domain"], "right").select(coalesce("asset","asset_domain").alias("asset"),"dns_count","ip","ev","post","form_present","eid","oid").orderBy("asset").show()
#+----------------+---------+-------------+----+----+------------+------+------+
#|           asset|dns_count|           ip|  ev|post|form_present|   eid|   oid|
#+----------------+---------+-------------+----+----+------------+------+------+
#|nebraska.aaa.com|     null|         null|null|null|        null|335660|335660|
#|     netflix.com|     3836|22.22.222.222|null|null|        None|   460|   460|
#|sub1.test-domain|     6354|11.11.111.111|   1|   1|        null|126656|126656|
#+----------------+---------+-------------+----+----+------------+------+------+

Upvotes: 4

Related Questions