Reputation: 1290
I have a spark job running with many executors.
I want to be able to use a counter on the executors to count the the number of occurrences of an event. For example count the number of times the column "column" is 10.
df.map(df => if(df.get("column")==10){ counter.inc } ; df)
I ultimately want the total to be the sum of counters across all executors.
Is this possible?
When we are reporting metrics from a spark driver we extend org.apache.spark.metrics.source.Source and register it in the spark env. can these metrics be used on the executors ever?
Upvotes: 1
Views: 550
Reputation: 3572
I think the best way is to use the Spark aggregation "sum".
Thus the Spark will internally execute the aggregation in a distributed way in all the nodes and then return the aggreagate back to the driver.
df.withColumn("count_flag", when(col("column")==10, 1).otherwise(0))
.agg(sum("count_flag") as "Total_Occurance_Of_Column_Value_10")
.show()
Upvotes: 1