Reputation: 43
I'm told the recommended number of workers to set is one per core when using Spark. I am not getting a degradation in performance though when the number of workers is well above the cores on my computer. Why could this be?
from time import time
from pyspark import SparkContext
for j in range(1, 10):
sc = SparkContext(master="local[%d]" % (j))
t0 = time()
for i in range(10):
sc.parallelize([1, 2] * 1000000).reduce(lambda x, y: x + y)
print("%2d executors, time=%4.3f" % (j, time() - t0))
sc.stop()
# 1 executors time=6.112
# 2 executors time=5.202
# 3 executors time=4.695
# 4 executors time=5.090
# 5 executors time=5.262
# 6 executors time=5.156
# 7 executors time=5.274
# 8 executors time=5.376
# 9 executors time=5.124
Hardware specs:
Upvotes: 0
Views: 184
Reputation: 35249
You don't measure anything useful. With such small amount of data processing time is negligible:
>>> %timeit for i in range(10): sum([1, 2] * 100000)
23.6 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The actual latency is caused almost completely by the initialization time, and to smaller extent, by scheduling overhead.
Also
the recommended number of workers to set is one per core when using Spark.
is not really correct. Many Spark jobs are IO bounded, and oversubscribing resources is recommended.
The biggest concern is that if tasks become to small (like here) cost of starting task is larger than the cost of processing. In practice your processor is switching threads many times per second - adding a few additional threads just won't make much of a difference.
Upvotes: 1