Reputation: 6915
I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map
transformation) gets done really fast.
However, after having processed around 500GB of data, that map
transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.
Upvotes: 5
Views: 7367
Reputation: 73366
It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:
Upvotes: 6
Reputation: 943
Without any more info it would seem that at some point of the computation your data gets spilled to the disk because there is no more space in memory. It's just a guess, you should check your Spark UI.
Upvotes: 0