So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried, df.count() df.rdd.isEmpty() len(df.head(1))==0 Converted to Pandas and tried pandas_df.empty() Tried the arrow option df.cache() and df.persist() before the counts df.repartition(n) Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins) Could you please help us on what we are doing wrong. Note : There are no duplicate values in df and we have done multiple joins to form the df

Reputation: 123

Pyspark Dataframe count taking too long

So we have a Pyspark Dataframe which has around 25k records. We are trying to perform a count/empty check on this and it is taking too long. We tried,

df.count()
df.rdd.isEmpty()
len(df.head(1))==0
Converted to Pandas and tried pandas_df.empty()
Tried the arrow option
df.cache() and df.persist() before the counts
df.repartition(n)
Tried writing the df to DBFS, but writing is also taking quite a long time(cancelled after 20 mins)

Could you please help us on what we are doing wrong.

Note : There are no duplicate values in df and we have done multiple joins to form the df

Upvotes: 2

Answers (1)

Matt Andruff

Reputation: 5155

Without looking at the df.explain() it's challenging to know specifically the issue but it certainly seems like you have could have a skewed data set. (Skew usually is represented in the Spark UI with 1 executor taking a lot longer than the other partitions to finish.) If you on a recent version of spark there are tools to help with this out of the box:

spark.sql.adaptive.enabled  = true
spark.sql.adaptive.skewJoin.enabled = true

Count is not taking too long. It's taking the time it needs to, to complete what you asked spark to do. To refine what it's doing you should do things you are likely already doing, filter the data first before joining so only critical data is being transferred to the joins. Reviewing your data for Skew, and programming around it, if you can't use adaptive query.

Convince yourself this is a data issue. Limit your source [data/tables] to 1000 or 10000 records and see if it runs fast. Then one at a time, remove the limit from only one [table/data source] (and apply limit to all others) and find the table that is the source of your problem. Then study the [table/data source] and figure out how you can work around the issue.(If you can't use adaptive query to fix the issue.)

(Finally If you are using hive tables, you should make sure the table stats are up to date.)

ANALYZE TABLE mytable COMPUTE STATISTICS;

Upvotes: 3

Pyspark Dataframe count taking too long

Answers (1)

Related Questions