Reputation: 601
We have streaming application which has count action
tempRequestsWithState is a DStream
tempRequestsWithState.foreachRDD { rdd =>
print (rdd.count())
}
The count action is taking a lot of time and slow taking about 30 mins Would greatly appreciate if anyone could suggest a way to speedup this action as we are consuming @ 10,000 events/sec Also noticed we have 54 partitions for each RDD
enter image description here
Upvotes: 1
Views: 2041
Reputation: 27373
Although I have never used it, you could try to use countApprox
on your RDD
. This seems to give you an estimate of the true count for a given time you want to spend (milliseconds) and a confidence interval (i.e. the probabilty that the true value is within that range):
example usage:
val cntInterval = df.rdd.countApprox(timeout = 1000L,confidence = 0.95)
val (lowCnt,highCnt) = (cntInterval.initialValue.low, cntInterval.initialValue.high)
Upvotes: 1