Anthony Davies
Anthony Davies

Reputation: 81

Databricks - Failure Starting REPL

I am using a databricks cluster to run some ETLs.

During the night, there is a peak in executions, no library is installed during executions.

Spark version is 3.2.0 and scala version is 2.12. Runtime 10.2.

During the execution peaks, sometimes a failure to start the python REPL causes a failure in some notebooks that are vital to the process.

The error can be seen in the first image.

This error is happening since last month. I have incresead the max executors from 2 to 3, but the erros is still happening some days. The cluster information can be seen in the second image. The peak generally executes 50 ETLs at the same time.

The failure happened between 10:05 and 10:15.

Worker information can be found in the third image.

Error

Memory and CPU info

Cluster Worker Info

Upvotes: 4

Views: 12477

Answers (1)

Anthony Davies
Anthony Davies

Reputation: 81

Actually, the real issue was the Driver.

The max workers were set to 3, but sometimes, during the execution peak, the number of worker remained 2. Exploring the issue, I identified that the whole problem happened during the commands interpretation, not during tasks processing. So, the driver suffered to proccess all the code and organize the tasks during this peak.

The easy solution was to increase tha machine memory and cores, but this would cause a increase of US$ 10 000 per year (Only driver) and US$ 10 000 per Year per worker. (20 hours a day, every day). Because of the cost, I decided to develop something similar to a job Pool. Which limits the number of executions sent to the driver and spread them during the time.

This solution solved the problem and saved a cost of up to US$ 40 000 during the next 365 days of operation.

Upvotes: 4

Related Questions