PySpark partitionBy, repartition, or nothing?

Question

So what I've done is

rdd.flatMap(lambda x: enumerate(x))

Making keys 0-49 for my data. Then I decided to do:

rdd.flatMap(lambda x: enumerate(x)).partitionBy(50)

I noticed something odd happening, so for the following File Size that a 10GB took 46 seconds to do my calculations and a 50GB file took 10mins 31seconds. I checked the file and for some reason it was only in 4 blocks.

So what I did was changed:

sc.textFile("file", 100)

I removed the partition by and the 50GB file dropped down to about 1 min. I was wondering if it still makes sense to try and re partition the data after it loads? Maybe by key?

Nikita · Accepted Answer

If I understood your question correctly you ask when you need additional repartition. First, you should remember that repartition is an expensive operation. Use it wisely. Second, there is no rigorous answer, and it comes with experience. But some common cases are:

You can try to call repartition on your date before join, leftOuterJoin, cogroup... Sometimes it can speed up computation.
You flatMap your data into more "heavy-weighted" data and encounter Java heap space Exception java.lang.OutOfMemoryError. Then you certainly should make your partitions smaller to fit the data after flatMap.
You load data into database\mongoDb\elasticSearch... You call repartition on your data, then inside the forEachPartition code block you make bulk insert of all this partition into database. So the size of these chunks should be reasonable.

PySpark partitionBy, repartition, or nothing?

Answers (1)

Related Questions