Funzo
Funzo

Reputation: 1290

How to count a metrics across exectuors in spark

I have a spark job running with many executors.

I want to be able to use a counter on the executors to count the the number of occurrences of an event. For example count the number of times the column "column" is 10.

df.map(df => if(df.get("column")==10){ counter.inc } ; df)

I ultimately want the total to be the sum of counters across all executors.
Is this possible?

When we are reporting metrics from a spark driver we extend org.apache.spark.metrics.source.Source and register it in the spark env. can these metrics be used on the executors ever?

Upvotes: 1

Views: 550

Answers (1)

Remis Haroon - رامز
Remis Haroon - رامز

Reputation: 3572

I think the best way is to use the Spark aggregation "sum".

Thus the Spark will internally execute the aggregation in a distributed way in all the nodes and then return the aggreagate back to the driver.

df.withColumn("count_flag", when(col("column")==10, 1).otherwise(0))
  .agg(sum("count_flag") as "Total_Occurance_Of_Column_Value_10")
  .show()

Upvotes: 1

Related Questions