BAR
BAR

Reputation: 17161

Google Dataproc Jobs Never Cancel, Stop, or Terminate

I have been using Google Dataproc for a few weeks now and since I started I had a problem with canceling and stopping jobs.

It seems like there must be some server other than those created on cluster setup, that keeps track of and supervises jobs.

I have never had a process that does its job without error actually stop when I hit stop in the dev console. The spinner just keeps spinning and spinning.

Cluster restart or stop does nothing, even if stopped for hours.

Only when the cluster is entirely deleted will the jobs disappear... (But wait there's more!) If you create a new cluster with the same settings, before the previous cluster's jobs have been deleted, the old jobs will start on the new cluster!!!

I have seen jobs that terminate on their own due to OOM errors restart themselves after cluster restart! (with no coding for this sort of fault tolerance on my side)

How can I forcefully stop Dataproc jobs? (gcloud beta dataproc jobs kill does not work)

Does anyone know what is going on with these seemingly related issues?

Is there a special way to shutdown a Spark job to avoid these issues?

Upvotes: 3

Views: 3293

Answers (1)

James
James

Reputation: 2331

Jobs keep running

In some cases, errors have not been successfully reported to the Cloud Dataproc service. Thus, if a job fails, it appears to run forever even though it (has probably) failed on the back end. This should be fixed by a soon-to-be released version of Dataproc in the next 1-2 weeks.

Job starts after restart

This would be unintended and undesirable. We have tried to replicate this issue and cannot. If anyone can replicate this reliably, we'd like to know so we can fix it! This may (is provably) be related to the issue above where the job has failed but appears to be running, even after a cluster restarts.

Best way to shutdown

Ideally, the best way to shutdown a Cloud Dataproc cluster is to terminate the cluster and start a new one. If that will be problematic, you can try a bulk restart of the Compute Engine VMs; it will be much easier to create a new cluster, however.

Upvotes: 1

Related Questions