Reputation: 193
I've inherited some code from an ex-coworker where he started using futures (in Scala) to process some data in Databricks.
I split it into chunks that complete in a similar time period. However there is no output, and I know they aren't using onSuccess or Await or anything.
The thing is, the code finishes running (doesn't return output) but the block in Databricks keeps executing until the thread.sleep() part.
I'm new to Scala and futures and am not sure how I can just exit the notebook once all the futures finish running (should i just use dbutils.notebook.exit() after the future blocks?)
Code is below:
import scala.concurrent.{Future, blocking, Await}
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import com.databricks.WorkflowException
val numNotebooksInParallel = 15
// If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once.
// This code limits the number of parallel notebooks.
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numNotebooksInParallel))
val ctx = dbutils.notebook.getContext()
// The simplest interface we can have but doesn't
// have protection for submitting to many notebooks in parallel at once
println("starting parallel jobs... hang tight")
Future {
process("pro","bseg")
process("prc","bkpf")
process("prc","bseg")
process("pr4","bkpf")
process("pr4","bseg")
println("done with future1")
}
Future {
process("pr5","bkpf")
process("pr5","bseg")
process("pri","bkpf")
process("pri","bseg")
process("pr9","bkpf")
println("done with future2")
}
Future {
process("pr9","bseg")
process("prl","bkpf")
process("prl","bseg")
process("pro","bkpf")
println("done with future3")
}
println("finished futures - yay! :)")
Thread.sleep(5*60*60*1000)
println("thread timed out after 5 hrs... hope it all finished.")
Upvotes: 0
Views: 344
Reputation: 20561
One would typically save the futures as values:
val futs = Seq(
Future {
process("pro","bseg")
// and so on
},
// then the other futures
)
and then operate on the futures:
import scala.concurrent.Await
import scala.concurrent.duration._
Await.result(Future.sequence(futs), 5.hours)
Future.sequence
will stop at the first one that fails or once they've all succeeded. If you want them all to run even if one fails, you could do something like
Await.result(
futs.foldLeft(Future.unit) { (_, f) =>
f.recover {
case _ => ()
}
},
5.hours
)
Upvotes: 1