Doof
Doof

Reputation: 143

pyspark "isin" taking too long

I have a list of ID's which needs to be filtered from a pyspark.sql.DataFrame. The ID has 3000000 values. The approach I am using is

df_tmp.filter(fn.col("device_id").isin(device_id))

This is taking very long and getting stuck. What is an alternate for this?

Upvotes: 3

Views: 2206

Answers (1)

Steven
Steven

Reputation: 15258

try this :

from pyspark.sql import functions as F

df_temp.join(
    F.broadcast(
        spark.createDataFrame(
            [(ID_,) for ID_ in device_id],
            ["device_id"],
        )
    ),
    on="device_id",
)

Upvotes: 7

Related Questions