Reputation: 618
I have two dataframes that need to be joined in a particular way I am struggling with.
dataframe 1:
+--------------------+---------+----------------+
| asset_domain| eid| oid|
+--------------------+---------+----------------+
| test-domain...| 126656| 126656|
| nebraska.aaa.com| 335660| 335660|
| netflix.com| 460| 460|
+--------------------+---------+----------------+
dataframe 2:
+--------------------+--------------------+---------+--------------+----+----+------------+
| asset| asset_domain|dns_count| ip| ev|post|form_present|
+--------------------+--------------------+---------+--------------+----+----+------------+
| sub1.test-domain...| test-domain...| 6354| 11.11.111.111| 1| 1| null|
| netflix.com| netflix.com| 3836| 22.22.222.222|null|null| null|
+--------------------+--------------------+---------+--------------+----+----+------------+
desired result:
+--------------------+---------+-------------+----+----+------------+---------+----------------+
| asset|dns_count| ip| ev|post|form_present| eid| oid|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
| netflix.com| 3836|22.22.222.222|null|null| null| 460| 460|
| sub1.test-domain...| 5924|111.11.111.11| 1| 1| null| 126656| 126656|
| nebraska.aaa.com| null| null|null|null| null| 335660| 335660|
+--------------------+---------+-------------+----+----+------------+---------+----------------+
Basically – it should join df1 and df2 on asset_domain
but if that doesn't exist in df2, then the resulting asset
should be the asset_domain
from df1.
I tried df = df2.join(df1, ["asset_domain"], "right").drop("asset_domain")
but that obviously leaves null
in the asset
column for nebraska.aaa.com
since it does not have a matching domain in df2. How do I go about adding those to the asset
column for this particular case?
Upvotes: 2
Views: 626
Reputation: 1712
After the join you can use the isNull() function
import pyspark.sql.functions as F
tst1 = sqlContext.createDataFrame([('netflix',1),('amazon',2)],schema=("asset_domain",'xtra1'))
tst2= sqlContext.createDataFrame([('netflix','yahoo',1),('amazon','yahoo',2),('flipkart',None,2)],schema=("asset_domain","asset",'xtra'))
tst_j = tst1.join(tst2,on='asset_domain',how='right')
#%%
tst_res = tst_j.withColumn("asset",F.when(F.col('asset').isNull(),F.col('asset_domain')).otherwise(F.col('asset')))
Upvotes: 0
Reputation: 31540
you can use coalesce
function after join to create asset column.
df2.join(df1, ["asset_domain"], "right").select(coalesce("asset","asset_domain").alias("asset"),"dns_count","ip","ev","post","form_present","eid","oid").orderBy("asset").show()
#+----------------+---------+-------------+----+----+------------+------+------+
#| asset|dns_count| ip| ev|post|form_present| eid| oid|
#+----------------+---------+-------------+----+----+------------+------+------+
#|nebraska.aaa.com| null| null|null|null| null|335660|335660|
#| netflix.com| 3836|22.22.222.222|null|null| None| 460| 460|
#|sub1.test-domain| 6354|11.11.111.111| 1| 1| null|126656|126656|
#+----------------+---------+-------------+----+----+------------+------+------+
Upvotes: 4