Arnold Taremwa
Arnold Taremwa

Reputation: 43

Why the near constant execution time when increasing workers of Spark standalone

I'm told the recommended number of workers to set is one per core when using Spark. I am not getting a degradation in performance though when the number of workers is well above the cores on my computer. Why could this be?

from time import time
from pyspark import SparkContext

for j in range(1, 10):
    sc = SparkContext(master="local[%d]" % (j))
    t0 = time()
    for i in range(10):
        sc.parallelize([1, 2] * 1000000).reduce(lambda x, y: x + y)
    print("%2d executors, time=%4.3f" % (j, time() - t0))
    sc.stop()

# 1 executors time=6.112
# 2 executors time=5.202
# 3 executors time=4.695
# 4 executors time=5.090
# 5 executors time=5.262
# 6 executors time=5.156
# 7 executors time=5.274
# 8 executors time=5.376
# 9 executors time=5.124

Hardware specs:

Upvotes: 0

Views: 184

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35249

You don't measure anything useful. With such small amount of data processing time is negligible:

>>> %timeit  for i in range(10): sum([1, 2] * 100000)
23.6 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The actual latency is caused almost completely by the initialization time, and to smaller extent, by scheduling overhead.

Also

the recommended number of workers to set is one per core when using Spark.

is not really correct. Many Spark jobs are IO bounded, and oversubscribing resources is recommended.

The biggest concern is that if tasks become to small (like here) cost of starting task is larger than the cost of processing. In practice your processor is switching threads many times per second - adding a few additional threads just won't make much of a difference.

Upvotes: 1

Related Questions