wylie
wylie

Reputation: 193

AWS Databricks cluster start failure

I am currently unable to spin up any clusters in our databricks AWS environment.

When I attempt to start up an on-demand cluster, it remains in "pending" for 20+ minutes (on relatively small clusters which usually take 2-3 min to start up).

Similarly, all of my scheduled jobs are failing due to their job clusters not being able to start either. This is a sample error message:

Run result unavailable: job failed with error message Unexpected failure while waiting for the cluster [cluster_name] to be ready. Cause Cluster [cluster_name] is unusable since the driver is unhealthy.

When I try to investigate the issue, the driver logs are completely empty. I have tried to initiate clusters with runtimes 9.1 and 10.4 and see the same issue.

Has anyone seen this before? Is this a databricks issue or an AWS issue?

Upvotes: 2

Views: 1527

Answers (2)

smoot
smoot

Reputation: 322

This is a pretty vague error message so there are 2 good options I use for troubleshooting that work most times

  1. If it's shut down due to a cloud provider API call: You can see the instance-id in the Event Log of a databricks cluster, and then using that instance-id you can log onto AWS and go CloudTrail > Event History > Change source to "Event Name" and search for "StopInstances" which will give you the reasoning
  2. Otherwise on the instance under EC2 console you can go to Monitor and Troubleshoot > Get System Logs and it should give you everything from the EC2 logs itself

Upvotes: 0

Robert Long
Robert Long

Reputation: 6812

Has anyone seen this before? Is this a databricks issue or an AWS issue?

Yes I have seen this before. In almost all cases it was a cloud provider problem which resolved itself within a few hours. I have also seen this after a networking change where a new VPC was set up. Unless your networking has changed, and if the problem still persists I would register a support ticket with databricks.

Upvotes: 1

Related Questions