Reputation: 1370
I have a pyspark application running on EMR for which I'd like to monitor some metrics.
For example count loaded, saved rows. Currently I use count
operation to extract values, which, obviously, slows down the application. I was thinking whether there are a better options to extract those kind of metrics from dataframe?
I'm using pyspark 2.4.5
Upvotes: 1
Views: 4549
Reputation: 807
If you're counting the full dataframe, try persisting the dataframe first, so that you don't have to run the computation twice.
If an approximate count is acceptable, you can sample before counting to speed things up.
sample_fraction = 0.01 # take a roughly 1% sample
sample_count = df.sample(fraction=sample_fraction).count() # count the sample
extrapolated_count = sample_count / sample_fraction # estimate the total count
There's also an approx_count_distinct
function, if you need a count of distinct values for a particular column.
Upvotes: 2
Reputation: 2178
If you need exact count then use parquet
or delta lake
format to store the data. It stores statistics so count results are fast (in seconds).
If you can live without exact count then you can use Dataframe.isEmpty
, Dataframe.first
, Dataframe.head(<number of rows>)
etc to compensate for your needs.
Upvotes: 3