Reputation: 49
What would be the equivalent code in PySpark?
If I have table A and Table B, and I want to select certain ID from Table A which is not in Table B, I can do the following SQL command:
Select ID
from Table A
where ID not in (Select ID from Table B)
What would be the equivalent code in PySpark?
Upvotes: 1
Views: 390
Reputation: 29307
You could do a "left anti-join" with the option "left_anti"
:
A_df.show()
# +-----+---+
# | type| id|
# +-----+---+
# |type1| 10|
# |type2| 20|
# +-----+---+
B_df.show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 1|name1| 10|
# | 2|name2| 30|
# | 3|name3| 20|
# +---+-----+----+
B_df.join(A_df, B_df.type == A_df.id, "anti").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 2|name2| 30|
# +---+-----+----+
This would be equivalent to select * from B_df where type not in (select id from A_df)
In an SQL context (see spark sql anti-join):
from pyspark.sql import SQLContext
sqlc = SQLContext(sc)
# register dataframe as tables in the SQL context
sqlc.registerDataFrameAsTable(A_df, "A_table")
sqlc.registerDataFrameAsTable(B_df, "B_table")
spark.sql("SELECT * FROM B_table LEFT ANTI JOIN A_table ON B_table.type == A_table.id").show()
# +---+-----+----+
# | id| name|type|
# +---+-----+----+
# | 2|name2| 30|
# +---+-----+----+
Here's how I created the dataframes:
A = [("type1",10), \
("type2",20), \
]
AColumns = ["type","id"]
A_df = spark.createDataFrame(data=A, schema = AColumns)
A_df.printSchema()
A_df.show(truncate=False)
B = [(1,"name1",10), \
(2,"name2",30), \
(3,"name3",20) \
]
BColumns = ["id","name","type"]
B_df = spark.createDataFrame(data=B, schema = BColumns)
B_df.printSchema()
B_df.show(truncate=False)
Upvotes: 1