Reputation: 51

Databricks notebook hanging with pytorch

We have a Databricks notebooks issue. One of our notebook cells seems to be hanging, while the driver logs do show that the notebook cell has been executed. Does anyone know why our notebook cell keeps hanging, and does not complete? See below the details.

Situation

We are training a ML model with pytorch in the Databricks notebook UI
The training uses mlflow to register a model
At the end of the cell we print a statement "Done with training"
We are using a single node cluster with
- Databricks Runtime: 10.4 LTS ML (includes Apache Spark 3.2.1, GPU, Scala 2.12)
- Node type: Standard_NC6s_v3

Observations

In the Databricks notebook UI we see the cell running pytorch training and showing the intermediate logs of the training
After awhile the model is registered in mlflow but we don't see this log in the Databricks notebook UI
We can also see the print statement "Done with training" in the driver logs. We don't see this statement in the Databricks notebook UI

Code

from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

trainer = Trainer(gpus=-1, num_sanity_val_steps=0, logger = logger, callbacks=[EarlyStopping(monitor="test_loss", patience = 2, mode = "min", verbose=True)])

with mlflow.start_run(experiment_id = experiment_id) as run:
  trainer.fit(model, train_loader, val_loader)
  mlflow.log_param("param1", param1)    
  mlflow.log_param("param2", param2)    
  mlflow.pytorch.log_model(model._image_model, artifact_path="model", registered_model_name="image_model")
  mlflow.pytorch.log_state_dict(model._image_model.state_dict(), "model")
  
print("Done with training")

Packages

mlflow-skinny==1.25.1
torch==1.10.2+cu111 
torchvision==0.11.3+cu111

Solutions that I tried that did not work

Tried adding cache deletion, but that did not work

# Cleaning up to avoid any open processes...  
del trainer
torch.cuda.empty_cache()
# force garbage collection
gc.collect()

Tried forcing exiting the notebook, but also did not work

parameters = json.dumps({"Status": "SUCCESS", "Message": "DONE"})
dbutils.notebook.exit(parameters)

Upvotes: 2

Answers (2)

Mohammad Subhani

Reputation: 1

https://stackoverflow.com/a/72473053/10555941

There is more to add to the above answer. When you set pin_memory True and have num_workers equal to the total number of vCpus on the node, it uses IPC for the threads to communicate. These IPC's use shared memory and they suffocate the shared memory of VM.

This leads to the hanging of the processes. The DataLoader num_workers is just to help in data loading using child threads. That said it need not be an extreme value to speed up data-loading. Having it small like 30% of vCPUS would suffice for data loading.

Upvotes: 0

macpro1004

Reputation: 51

I figured out the issue. To solve this, adjust the parameters for the torch.utils.data.DataLoader

Disable pin_memory
Set num_workers to 30% of total vCPU (e.g. 1 or 2 for Standard_NC6s_v3)

For example:

train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    num_workers=1,
    pin_memory=False,
    shuffle=True,
)

This issue seems to be related to PyTorch and is due to a deadlock issue. See the details here.

Upvotes: 3

Databricks notebook hanging with pytorch

Answers (2)

Related Questions