iurii_n
iurii_n

Reputation: 1370

What is the best way to count rows in spark data frame for monitoring?

I have a pyspark application running on EMR for which I'd like to monitor some metrics. For example count loaded, saved rows. Currently I use count operation to extract values, which, obviously, slows down the application. I was thinking whether there are a better options to extract those kind of metrics from dataframe? I'm using pyspark 2.4.5

Upvotes: 1

Views: 4549

Answers (2)

mattficke
mattficke

Reputation: 807

If you're counting the full dataframe, try persisting the dataframe first, so that you don't have to run the computation twice.

If an approximate count is acceptable, you can sample before counting to speed things up.

sample_fraction = 0.01 # take a roughly 1% sample
sample_count = df.sample(fraction=sample_fraction).count() # count the sample
extrapolated_count = sample_count / sample_fraction # estimate the total count

There's also an approx_count_distinct function, if you need a count of distinct values for a particular column.

Upvotes: 2

Salim
Salim

Reputation: 2178

If you need exact count then use parquet or delta lake format to store the data. It stores statistics so count results are fast (in seconds).

If you can live without exact count then you can use Dataframe.isEmpty, Dataframe.first, Dataframe.head(<number of rows>) etc to compensate for your needs.

Upvotes: 3

Related Questions