PySpark AND EXISTS equivalent to sql query

Question

I have following sql query and I need same result in pySpark:

Select * 
From FIRSTKEY o 
Where o.INCRNR=123456
And exists ( 
select 1 
from SECONDKEY b 
where o.ID = b.ID 
and o.IPDAT = b.IPDAT            
and b.INCRNR=123456 
and b.TYPE = '02'  
)

Especially the and exists condition is one that I cannot resolve. I saw here, that it can be done by join or LEFT SEMI JOIN but as its still not clear to me how to use it I would be grateful to see it on example above. Thanks for any advice!

EDIT: I need it with build in functions as because of other transformations of data. Something like:

def incrementalTransformFunction(output, fsInput, fsInput2):
    input_df = fsInput.dataframe('added')
    input2_df = fsInput2.dataframe('added')
    # output
    output.write_dataframe(input_df.drop(*drop_list).filter(input_df.incrnr==123456)
.andexists((b.id == o.id) & (o.ipdat == b.ipdat) & (b.incrnr == 123456) & (b.type == '02')))

Shubham Jain · Accepted Answer

As suggested by @Jim

SQL

Select * 
From FIRSTKEY o 
Where o.INCRNR=123456 And (o.ID,o.IPDAT) in  ( 
select ID, IPDAT from SECONDKEY b 
where b.INCRNR=123456 and b.TYPE = '02');

pyspark

FIRSTKEY.filter(col('INCRNR')=='123456').\
         join(SECONDKEY.filter((col('INCRNR')=='123456') & (col('TYPE')=='02')), ['ID', 'IPDAT'], 'inner')

PySpark AND EXISTS equivalent to sql query

Answers (2)

Related Questions