macpro1004
macpro1004

Reputation: 51

Databricks notebook hanging with pytorch

We have a Databricks notebooks issue. One of our notebook cells seems to be hanging, while the driver logs do show that the notebook cell has been executed. Does anyone know why our notebook cell keeps hanging, and does not complete? See below the details.

Situation

Observations

Code

from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

trainer = Trainer(gpus=-1, num_sanity_val_steps=0, logger = logger, callbacks=[EarlyStopping(monitor="test_loss", patience = 2, mode = "min", verbose=True)])

with mlflow.start_run(experiment_id = experiment_id) as run:
  trainer.fit(model, train_loader, val_loader)
  mlflow.log_param("param1", param1)    
  mlflow.log_param("param2", param2)    
  mlflow.pytorch.log_model(model._image_model, artifact_path="model", registered_model_name="image_model")
  mlflow.pytorch.log_state_dict(model._image_model.state_dict(), "model")
  
print("Done with training")

Packages

mlflow-skinny==1.25.1
torch==1.10.2+cu111 
torchvision==0.11.3+cu111

Solutions that I tried that did not work

# Cleaning up to avoid any open processes...  
del trainer
torch.cuda.empty_cache()
# force garbage collection
gc.collect()
parameters = json.dumps({"Status": "SUCCESS", "Message": "DONE"})
dbutils.notebook.exit(parameters)

Upvotes: 2

Views: 1199

Answers (2)

Mohammad Subhani
Mohammad Subhani

Reputation: 1

https://stackoverflow.com/a/72473053/10555941

There is more to add to the above answer. When you set pin_memory True and have num_workers equal to the total number of vCpus on the node, it uses IPC for the threads to communicate. These IPC's use shared memory and they suffocate the shared memory of VM.

This leads to the hanging of the processes. The DataLoader num_workers is just to help in data loading using child threads. That said it need not be an extreme value to speed up data-loading. Having it small like 30% of vCPUS would suffice for data loading.

Upvotes: 0

macpro1004
macpro1004

Reputation: 51

I figured out the issue. To solve this, adjust the parameters for the torch.utils.data.DataLoader

  1. Disable pin_memory
  2. Set num_workers to 30% of total vCPU (e.g. 1 or 2 for Standard_NC6s_v3)

For example:

train_loader = DataLoader(
    train_dataset,
    batch_size=32,
    num_workers=1,
    pin_memory=False,
    shuffle=True,
)

This issue seems to be related to PyTorch and is due to a deadlock issue. See the details here.

Upvotes: 3

Related Questions