Reputation: 601
Could someone please explain the difference between RDD countApprox() vs count() and also if possible can answer which is the fastest ? it would be of great help we have a requirement where count() is very slow takes about 30 min's ** ...tried countApprox() it was **fast for the first run (**About 1.2 min) and then slowed to 30 min's .....
this is how we used it not sure if it's the best way to use
rdd.countApprox(timeout=800, confidence=0.5)
Upvotes: 5
Views: 8018
Reputation: 4650
Not my answer, but there is a very useful and important answer here.
In very short, countApprax.getFinalValue
blocks even if this is longer than the timeout.
getInitialValue
does not block and so you will get a response within the timeout.
BUT, as I learned from painful experience, even if you use getInitalValue
the process will continue to final value.
If you are repeating this in a loop, the getFinalValue
will be running for multiple RDDs long after you have retrieved the result from getInitialValue
. This can then lead to OOM conditions and broadcast errors that are difficult to diagnose
Upvotes: 3
Reputation: 1
rdd.count() is an action, which is an eager operation
.
This means that all the other transformations that you had written before that will start executing now because of Spark's lazy evaluation. So, essentially its not only Count() operation that's taking all the time but, all the other operations which were waiting to get executed.
Now coming back to the question of count() vs countApprox(). Count is just like doing a select count(*) from Table. countApprox can have a timeout and confidence level which returns back a result which is approximately correct and a number you can live with.
We should use countApprox when we are more interested in knowing an approximate number and save time for example in a streaming application. Count() should be used when you need the exact count for example to log something or for auditing.
Upvotes: 0
Reputation: 936
countApprox(timeout: Long, confidence: Double)
Default: confidence = 0.95
Note: As per the spark source code, support for countApprox is marked 'Experimental'.
With timeout=800, you should have seen an approximate count in <1min.
Are you sure nothing else is causing this slowdown of 30mins. Share your code/code-snippet to get more accurate inputs from other members.
Upvotes: 7