Reputation: 73
I'm having trouble with the count()
method of pyspark, which is way too slow for my program. I found out about the countApprox(timeout,confidence)
one but it doesn't speed up the process.
What i found doing a bit of research is that I should maybe use rdd.countApprox.initialValue
but it doesn't seem to work, as in pyspark the result of countApprox
is an int and not a PartialResult
object (I guess it is different in scala or java)
Does anyone know how to make countApprox
work in pyspark ?
My test code to compare :
a = sc.parallelize(range(1000000),10)
import time
t = time.time()
print("there are ",a.count()," rows")
print(time.time()-t)
gives :
there are 1000000 rows
3.4864296913146973
but
b = sc.parallelize(range(1000000),10)
import time
t = time.time()
print("there are ",b.countApprox(10,0.1)," rows")
print(time.time()-t)
gives out :
there are 1000000 rows
3.338970422744751
Which is pretty much the same time of execution...
Upvotes: 3
Views: 842
Reputation: 538
countApprox works faster than count, has timeout and confidence definition. I suppose you will see the runtime difference on a big datasets.
Upvotes: 1