user2359997
user2359997

Reputation: 601

Spark: rdd.countApprox() vs rdd.count()

Could someone please explain the difference between RDD countApprox() vs count() and also if possible can answer which is the fastest ? it would be of great help we have a requirement where count() is very slow takes about 30 min's ** ...tried countApprox() it was **fast for the first run (**About 1.2 min) and then slowed to 30 min's .....

this is how we used it not sure if it's the best way to use

rdd.countApprox(timeout=800, confidence=0.5)

Upvotes: 5

Views: 8018

Answers (3)

Jake
Jake

Reputation: 4650

Not my answer, but there is a very useful and important answer here.

In very short, countApprax.getFinalValue blocks even if this is longer than the timeout.

getInitialValue does not block and so you will get a response within the timeout.

BUT, as I learned from painful experience, even if you use getInitalValue the process will continue to final value.

If you are repeating this in a loop, the getFinalValue will be running for multiple RDDs long after you have retrieved the result from getInitialValue. This can then lead to OOM conditions and broadcast errors that are difficult to diagnose

Upvotes: 3

Priyanshu Kumar
Priyanshu Kumar

Reputation: 1

rdd.count() is an action, which is an eager operation.

This means that all the other transformations that you had written before that will start executing now because of Spark's lazy evaluation. So, essentially its not only Count() operation that's taking all the time but, all the other operations which were waiting to get executed.

Now coming back to the question of count() vs countApprox(). Count is just like doing a select count(*) from Table. countApprox can have a timeout and confidence level which returns back a result which is approximately correct and a number you can live with.

We should use countApprox when we are more interested in knowing an approximate number and save time for example in a streaming application. Count() should be used when you need the exact count for example to log something or for auditing.

Upvotes: 0

vmorusu
vmorusu

Reputation: 936

  • Count() - Returns you the number of elements in an RDD.
  • CountApprox - Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.

countApprox(timeout: Long, confidence: Double)

Default: confidence = 0.95

Note: As per the spark source code, support for countApprox is marked 'Experimental'.

With timeout=800, you should have seen an approximate count in <1min.

Are you sure nothing else is causing this slowdown of 30mins. Share your code/code-snippet to get more accurate inputs from other members.

Upvotes: 7

Related Questions