Reputation: 41
I have a pyspark dataframe df:
A B C
E00 FT AS
E01 FG AD
E02 FF AB
E03 FH AW
E04 FF AQ
E05 FV AR
E06 FD AE
and another smaller pyspark dataframe but with 3 rows with the same values, df2:
A B C
Null QW Null
QA Null Null
E03 FH AW
E04 FF AQ
E05 FV AR
Null Null Null
Is there a way in pyspark to create a third boolean dataframe from the rows in df2 are in df? Such as:
A B C
False False False
False False False
False False False
True True True
True True True
True True True
False False False
Many thanks in advance.
Upvotes: 0
Views: 28
Reputation: 42392
You can do a left join and assign False
if all columns joined from df2
are null:
import pyspark.sql.functions as F
result = df1.alias('df1').join(
df2.alias('df2'),
F.least(*[F.expr(f'df1.{c} = df2.{c}') for c in df1.columns]),
'left'
).select(
[
(~F.greatest(*[F.col(f'df2.{c2}').isNull() for c2 in df2.columns])).alias(c)
for c in df1.columns
]
)
result.show()
+-----+-----+-----+
| A| B| C|
+-----+-----+-----+
|false|false|false|
|false|false|false|
|false|false|false|
| true| true| true|
| true| true| true|
| true| true| true|
|false|false|false|
+-----+-----+-----+
Upvotes: 1