Reputation: 611
I've processed large data in spark and stored them in HDFS.
However, I feel that the saveAsTextFile method is somewhat slower.
So I wonder if there is a way to improve its performance.
My original code (which is running slower than expected)
val data = sc.textFile("data", 200);
data.
flatMap(_.split(" ")).
map(word => (word, 1)).
reduceByKey(_ + _).
saveAsTextFile("output")
When I add coalesce(1), the speed improves dramatically
val data = sc.textFile("data", 200);
data.
flatMap(_.split(" ")).
map(word => (word, 1)).
reduceByKey(_ + _).
coalesce(1).
saveAsTextFile("output")
Upvotes: 0
Views: 1644
Reputation: 4333
I am guessing your job is running slowly b/c you are asking for 200 partitions of your input. When you write your output to HDFS, it is writing 200 (probably small) files to HDFS. You notice the speed up when you coalesce down to 1.
I would suggest removing the 200 partitions in textFile and let Spark select the default parallelism.
val data = sc.textFile(inputDir) // no partitions specified
You may want to keep an eye on the file sizes written out at the end of the job still though. HDFS performs best when file sizes are close to a block size (I don't remember the default but around 256M I think).
Another reason more partitions can be slower is because Spark does setup/teardown per partition. There is a sweet spot to setting those numbers. Take a look at your Spark master, if there's 100ms of setup/teardown for 5ms of real work, you want less partitions for instance.
I always start with Spark setting to default though, and tweak from there as needed.
Upvotes: 2