Reputation: 143
I have a list of ID's which needs to be filtered from a pyspark.sql.DataFrame. The ID has 3000000 values. The approach I am using is
df_tmp.filter(fn.col("device_id").isin(device_id))
This is taking very long and getting stuck. What is an alternate for this?
Upvotes: 3
Views: 2206
Reputation: 15258
try this :
from pyspark.sql import functions as F
df_temp.join(
F.broadcast(
spark.createDataFrame(
[(ID_,) for ID_ in device_id],
["device_id"],
)
),
on="device_id",
)
Upvotes: 7