What is the best way to count rows in spark data frame for monitoring?

Question

I have a pyspark application running on EMR for which I'd like to monitor some metrics. For example count loaded, saved rows. Currently I use count operation to extract values, which, obviously, slows down the application. I was thinking whether there are a better options to extract those kind of metrics from dataframe? I'm using pyspark 2.4.5

mattficke · Accepted Answer

If you're counting the full dataframe, try persisting the dataframe first, so that you don't have to run the computation twice.

If an approximate count is acceptable, you can sample before counting to speed things up.

sample_fraction = 0.01 # take a roughly 1% sample
sample_count = df.sample(fraction=sample_fraction).count() # count the sample
extrapolated_count = sample_count / sample_fraction # estimate the total count

There's also an approx_count_distinct function, if you need a count of distinct values for a particular column.

What is the best way to count rows in spark data frame for monitoring?

Answers (2)

Related Questions