Abikian
Abikian

Reputation: 49

What would be a Pyspark equivalent of the SQL statement NOT IN

What would be the equivalent code in PySpark?

If I have table A and Table B, and I want to select certain ID from Table A which is not in Table B, I can do the following SQL command:

Select ID
from Table A where ID not in (Select ID from Table B)

What would be the equivalent code in PySpark?

Upvotes: 1

Views: 390

Answers (1)

user2314737
user2314737

Reputation: 29307

You could do a "left anti-join" with the option "left_anti":

A_df.show()
# +-----+---+
# | type| id|
# +-----+---+
# |type1| 10|
# |type2| 20|
# +-----+---+


B_df.show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# |  1|name1|  10|
# |  2|name2|  30|
# |  3|name3|  20|
# +---+-----+----+


B_df.join(A_df, B_df.type == A_df.id, "anti").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# |  2|name2|  30|
# +---+-----+----+

This would be equivalent to select * from B_df where type not in (select id from A_df)

In an SQL context (see spark sql anti-join):

from pyspark.sql import SQLContext
sqlc = SQLContext(sc)
# register dataframe as tables in the SQL context
sqlc.registerDataFrameAsTable(A_df, "A_table")
sqlc.registerDataFrameAsTable(B_df, "B_table")

spark.sql("SELECT * FROM B_table LEFT ANTI JOIN A_table ON B_table.type == A_table.id").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# |  2|name2|  30|
# +---+-----+----+

Here's how I created the dataframes:

A = [("type1",10), \
    ("type2",20), \
  ]
AColumns = ["type","id"]
A_df = spark.createDataFrame(data=A, schema = AColumns)
A_df.printSchema()
A_df.show(truncate=False)

B = [(1,"name1",10), \
    (2,"name2",30), \
    (3,"name3",20) \
  ]
BColumns = ["id","name","type"]
B_df = spark.createDataFrame(data=B, schema = BColumns)
B_df.printSchema()
B_df.show(truncate=False)

Upvotes: 1

Related Questions