Benoît Carlier
Benoît Carlier

Reputation: 73

pyspark countApprox() does not seem to be different from count()

I'm having trouble with the count() method of pyspark, which is way too slow for my program. I found out about the countApprox(timeout,confidence) one but it doesn't speed up the process.

What i found doing a bit of research is that I should maybe use rdd.countApprox.initialValue but it doesn't seem to work, as in pyspark the result of countApprox is an int and not a PartialResult object (I guess it is different in scala or java)

Does anyone know how to make countApprox work in pyspark ?

My test code to compare :

a = sc.parallelize(range(1000000),10)

import time
t = time.time()
print("there are ",a.count()," rows")
print(time.time()-t)

gives :

there are  1000000  rows
3.4864296913146973

but

b = sc.parallelize(range(1000000),10)

import time
t = time.time()
print("there are ",b.countApprox(10,0.1)," rows")
print(time.time()-t)

gives out :

 there are  1000000  rows
3.338970422744751

Which is pretty much the same time of execution...

Upvotes: 3

Views: 842

Answers (1)

David Greenshtein
David Greenshtein

Reputation: 538

countApprox works faster than count, has timeout and confidence definition. I suppose you will see the runtime difference on a big datasets.

Upvotes: 1

Related Questions