KLA
KLA

Reputation: 31

Databricks notebooks crashes on memory job

I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster. After upgrading it to 12 nodes, I started getting this and I am doubting that it is a config problem.

Any help please, I use the default spark configuration with partitions number=200 and I have 88 executors on my nodes.


Thanks
Internal error, sorry. Attach your notebook to a different cluster or restart the current cluster.
java.lang.RuntimeException: abort: DriverClient destroyed
    at com.databricks.backend.daemon.driver.DriverClient.$anonfun$poll$3(DriverClient.scala:381)
    at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
    at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
    at com.databricks.threading.NamedExecutor$$anon$2.$anonfun$run$1(NamedExecutor.scala:335)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at com.databricks.logging.UsageLogging.$anonfun$withAttributionContext$1(UsageLogging.scala:238)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62)
    at com.databricks.logging.UsageLogging.withAttributionContext(UsageLogging.scala:233)
    at com.databricks.logging.UsageLogging.withAttributionContext$(UsageLogging.scala:230)
    at com.databricks.threading.NamedExecutor.withAttributionContext(NamedExecutor.scala:265)
    at com.databricks.threading.NamedExecutor$$anon$2.run(NamedExecutor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Upvotes: 7

Views: 11470

Answers (2)

johnnyasd12
johnnyasd12

Reputation: 775

Just for other people facing similar issue.

In my situation, sometimes the same error happened when there's multiple Spark actions in one cell of a Databricks notebook.

Surprisingly, spliting the cell before the code where the error occurred or simply inserting time.sleep(5) there worked for me. However I'm not sure why it worked...

For example:

df1.count() # some Spark action

# split the cell or insert `time.sleep(5)` here

pipeline.fit(df1) # another Spark action where the error happened

Upvotes: 0

gip
gip

Reputation: 103

I'm not sure about the cost implications, but how about enabling auto scaling option on cluster and bumping up Max Workers. Also you can try changing the Worker Type to have better resources

enter image description here

Upvotes: 2

Related Questions