bry888
bry888

Reputation: 297

Conditional join on different columns

What is the most effective way to write this in pyspark:

Pseudocode:

table1 inner join table2 
on cookies if table1.cookie is not Null 
else join on ids

table1:

id, cookie
1, 1q2w
2, Null

table2:

id, cookie
1, 1q2w
2, 3e4r

Upvotes: 0

Views: 742

Answers (5)

Ali Yesilli
Ali Yesilli

Reputation: 2200

On pyspark side, you can create two df based on whether table1.cookie is null or not then merge them

>>> import pyspark.sql.functions as F
>>> df1 = table1.where(F.isnull('cookie')==True).join(table2, table1.id == table2.id, 'inner').select(table1.id,table2.cookie)
>>> df2 = table1.where(F.isnull('cookie')==False).join(table2, table1.cookie == table2.cookie, 'inner').select(table1.id,table2.cookie)
>>> df1.union(df2).show()
+---+------+                                                                    
| id|cookie|
+---+------+
|  2|  3e4r|
|  1|  1q2w|
+---+------+

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1271121

The most effective way is often to use left joins:

select . . .,
       coalesce(t2c.colx, t2i.colx) as colx
from table1 t1 inner join
     table2 t2c
     on t1.cookie = t2.cookie left join
     table2 t2i
     on t1.id = t2i.id and t2c.cookie is null

Upvotes: 0

HoneyBadger
HoneyBadger

Reputation: 15150

You can use OR as in the answers so far submitted. In my experience, though, joins with or perform very badly. You can also use a UNION|UNION ALL:

select * 
from table1 
inner join table2 
on table1.cookies= table2.cookies

UNION (ALL) -- UNION removes duplicates, UNION ALL keeps them.

select * 
from table1 
inner join table2 
on table1.id=table2.id

Upvotes: 1

Fahmi
Fahmi

Reputation: 37493

You can try using OR

DEMO

select * from 
table1 inner join table2 
on table1.cookies= table2.cookies or table1.id=table2.id

Upvotes: 0

Thorsten Kettner
Thorsten Kettner

Reputation: 95101

Join on cookies or on IDs when table1.cookie is null:

select *
from table1 t1
join table2 t2 on t1.cookie = t2.cookie
               or (t1.cookie is null and t1.id = t2.id)

Upvotes: 0

Related Questions