Reputation: 31
I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster. After upgrading it to 12 nodes, I started getting this and I am doubting that it is a config problem.
Any help please, I use the default spark configuration with partitions number=200 and I have 88 executors on my nodes.
Thanks
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.lang.RuntimeException: abort: DriverClient destroyed
at com.databricks.backend.daemon.driver.DriverClient.$anonfun$poll$3(DriverClient.scala:381)
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at com.databricks.threading.NamedExecutor$$anon$2.$anonfun$run$1(NamedExecutor.scala:335)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:233)
at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:230)
at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:265)
at com.databricks.threading.NamedExecutor$$anon$2.run(NamedExecutor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Upvotes: 7
Views: 11470
Reputation: 775
Just for other people facing similar issue.
In my situation, sometimes the same error happened when there's multiple Spark actions in one cell of a Databricks notebook.
Surprisingly, spliting the cell before the code where the error occurred or simply inserting time.sleep(5)
there worked for me. However I'm not sure why it worked...
For example:
df1.count() # some Spark action
# split the cell or insert `time.sleep(5)` here
pipeline.fit(df1) # another Spark action where the error happened
Upvotes: 0
Reputation: 103
I'm not sure about the cost implications, but how about enabling auto scaling option on cluster and bumping up Max Workers. Also you can try changing the Worker Type to have better resources
Upvotes: 2