Reputation: 193
How to change number of parallel tasks in pyspark ?
I mean how to change number of virtual maps that is run on my PC. actually I want to sketch Speed up chart by number of map functions.
sample code:
words = sc.parallelize(["scala","java","hadoop"])\
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
If you understand my purpose but I asked it in a wrong way I would appreciate if you correct it
Thanks
Upvotes: 0
Views: 1865
Reputation: 11
For this toy example number of parallel tasks will depend on:
rdd
- set by spark.default.parallelism
if not configured otherwise.local
(might be superseded by the above).However Spark is not a lightweight parallelization - for this we have low overhead alternatives like threading
and multiprocessing
, higher level components built on top of these (like joblib
or RxPy
) and native extensions (to escape GIL with threading).
Spark itself is heavyweight, with huge coordination and communication overhead, and as stated by by desernaut it is hardly justified for anything than testing, when limited to a single node. In fact, it can make things much worse with higher parallelism
Upvotes: 1