user2359997
user2359997

Reputation: 601

Spark : How to speed rdd.count()

We have streaming application which has count action

tempRequestsWithState is a DStream

tempRequestsWithState.foreachRDD { rdd =>

    print (rdd.count())

}

The count action is taking a lot of time and slow taking about 30 mins Would greatly appreciate if anyone could suggest a way to speedup this action as we are consuming @ 10,000 events/sec Also noticed we have 54 partitions for each RDD

enter image description here

enter image description here

enter image description here

Upvotes: 1

Views: 2041

Answers (1)

Raphael Roth
Raphael Roth

Reputation: 27373

Although I have never used it, you could try to use countApprox on your RDD. This seems to give you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):

example usage:

val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.95)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)

Upvotes: 1

Related Questions