Sohum Sachdev
Sohum Sachdev

Reputation: 1397

Spark's dataframe count() function taking very long

In my code, I have a sequence of dataframes where I want to filter out the dataframe's which are empty. I'm doing something like:

Seq(df1, df2).map(df => df.count() > 0)

However, this is taking extremely long and is consuming around 7 minutes for approximately 2 dataframe's of 100k rows each.

My question: Why is Spark's implementation of count() is slow. Is there a work-around?

Upvotes: 1

Views: 7536

Answers (1)

Ganesh
Ganesh

Reputation: 604

Count is a lazy operation. So it does not matter how big is your dataframe. But if you have too many costly operations on the data to get this dataframe, then once the count is called spark would actually do all the operations to get these dataframe.

Some of the costly operations may be operations which needs shuffling of data. Like groupBy, reduce etc.

So my guess is you have some complex processing to get these dataframes or your initial data which you used to get this dataframe is too huge.

Upvotes: 8

Related Questions