Reputation: 297
What is the most effective way to write this in pyspark:
Pseudocode:
table1 inner join table2
on cookies if table1.cookie is not Null
else join on ids
table1:
id, cookie
1, 1q2w
2, Null
table2:
id, cookie
1, 1q2w
2, 3e4r
Upvotes: 0
Views: 742
Reputation: 2200
On pyspark side, you can create two df based on whether table1.cookie is null or not then merge them
>>> import pyspark.sql.functions as F
>>> df1 = table1.where(F.isnull('cookie')==True).join(table2, table1.id == table2.id, 'inner').select(table1.id,table2.cookie)
>>> df2 = table1.where(F.isnull('cookie')==False).join(table2, table1.cookie == table2.cookie, 'inner').select(table1.id,table2.cookie)
>>> df1.union(df2).show()
+---+------+
| id|cookie|
+---+------+
| 2| 3e4r|
| 1| 1q2w|
+---+------+
Upvotes: 1
Reputation: 1271121
The most effective way is often to use left join
s:
select . . .,
coalesce(t2c.colx, t2i.colx) as colx
from table1 t1 inner join
table2 t2c
on t1.cookie = t2.cookie left join
table2 t2i
on t1.id = t2i.id and t2c.cookie is null
Upvotes: 0
Reputation: 15150
You can use OR
as in the answers so far submitted. In my experience, though, joins with or
perform very badly. You can also use a UNION|UNION ALL
:
select *
from table1
inner join table2
on table1.cookies= table2.cookies
UNION (ALL) -- UNION removes duplicates, UNION ALL keeps them.
select *
from table1
inner join table2
on table1.id=table2.id
Upvotes: 1
Reputation: 37493
You can try using OR
select * from
table1 inner join table2
on table1.cookies= table2.cookies or table1.id=table2.id
Upvotes: 0
Reputation: 95101
Join on cookies or on IDs when table1.cookie is null:
select *
from table1 t1
join table2 t2 on t1.cookie = t2.cookie
or (t1.cookie is null and t1.id = t2.id)
Upvotes: 0