Captain
Captain

Reputation: 193

How to change number of parallel tasks in pyspark

How to change number of parallel tasks in pyspark ?

I mean how to change number of virtual maps that is run on my PC. actually I want to sketch Speed up chart by number of map functions.

sample code:

words = sc.parallelize(["scala","java","hadoop"])\
           .map(lambda word: (word, 1)) \
           .reduceByKey(lambda a, b: a + b)

If you understand my purpose but I asked it in a wrong way I would appreciate if you correct it

Thanks

Upvotes: 0

Views: 1865

Answers (1)

user8966541
user8966541

Reputation: 11

For this toy example number of parallel tasks will depend on:

  • Number of partition for the input rdd - set by spark.default.parallelism if not configured otherwise.
  • Number of threads assigned to local (might be superseded by the above).
  • Physical and permission-based capabilities of the system.
  • Statistical properties of the dataset.

However Spark is not a lightweight parallelization - for this we have low overhead alternatives like threading and multiprocessing, higher level components built on top of these (like joblib or RxPy) and native extensions (to escape GIL with threading).

Spark itself is heavyweight, with huge coordination and communication overhead, and as stated by by desernaut it is hardly justified for anything than testing, when limited to a single node. In fact, it can make things much worse with higher parallelism

Upvotes: 1

Related Questions