Reputation: 71
I have following sql query and I need same result in pySpark:
Select *
From FIRSTKEY o
Where o.INCRNR=123456
And exists (
select 1
from SECONDKEY b
where o.ID = b.ID
and o.IPDAT = b.IPDAT
and b.INCRNR=123456
and b.TYPE = '02'
)
Especially the and exists
condition is one that I cannot resolve. I saw here, that it can be done by join
or LEFT SEMI JOIN
but as its still not clear to me how to use it I would be grateful to see it on example above. Thanks for any advice!
EDIT: I need it with build in functions as because of other transformations of data. Something like:
def incrementalTransformFunction(output, fsInput, fsInput2):
input_df = fsInput.dataframe('added')
input2_df = fsInput2.dataframe('added')
# output
output.write_dataframe(input_df.drop(*drop_list).filter(input_df.incrnr==123456)
.andexists((b.id == o.id) & (o.ipdat == b.ipdat) & (b.incrnr == 123456) & (b.type == '02')))
Upvotes: 1
Views: 345
Reputation: 5536
As suggested by @Jim
SQL
Select *
From FIRSTKEY o
Where o.INCRNR=123456 And (o.ID,o.IPDAT) in (
select ID, IPDAT from SECONDKEY b
where b.INCRNR=123456 and b.TYPE = '02');
pyspark
FIRSTKEY.filter(col('INCRNR')=='123456').\
join(SECONDKEY.filter((col('INCRNR')=='123456') & (col('TYPE')=='02')), ['ID', 'IPDAT'], 'inner')
Upvotes: 1
Reputation: 5155
You can use IN
condition in PySpqrk instead,
Select *
From FIRSTKEY o
Where o.INCRNR=123456 And (o.ID,o.IPDAT) in (
select ID, IPDAT from SECONDKEY b
where b.INCRNR=123456 and b.TYPE = '02');
Upvotes: 2