Reputation: 51
We have a Databricks notebooks issue. One of our notebook cells seems to be hanging, while the driver logs do show that the notebook cell has been executed. Does anyone know why our notebook cell keeps hanging, and does not complete? See below the details.
Situation
Observations
Code
from pytorch_lightning import Trainer
from pytorch_lightning.callbacks.early_stopping import EarlyStopping
trainer = Trainer(gpus=-1, num_sanity_val_steps=0, logger = logger, callbacks=[EarlyStopping(monitor="test_loss", patience = 2, mode = "min", verbose=True)])
with mlflow.start_run(experiment_id = experiment_id) as run:
trainer.fit(model, train_loader, val_loader)
mlflow.log_param("param1", param1)
mlflow.log_param("param2", param2)
mlflow.pytorch.log_model(model._image_model, artifact_path="model", registered_model_name="image_model")
mlflow.pytorch.log_state_dict(model._image_model.state_dict(), "model")
print("Done with training")
Packages
mlflow-skinny==1.25.1
torch==1.10.2+cu111
torchvision==0.11.3+cu111
Solutions that I tried that did not work
# Cleaning up to avoid any open processes...
del trainer
torch.cuda.empty_cache()
# force garbage collection
gc.collect()
parameters = json.dumps({"Status": "SUCCESS", "Message": "DONE"})
dbutils.notebook.exit(parameters)
Upvotes: 2
Views: 1199
Reputation: 1
https://stackoverflow.com/a/72473053/10555941
There is more to add to the above answer. When you set pin_memory True and have num_workers equal to the total number of vCpus on the node, it uses IPC for the threads to communicate. These IPC's use shared memory and they suffocate the shared memory of VM.
This leads to the hanging of the processes. The DataLoader num_workers is just to help in data loading using child threads. That said it need not be an extreme value to speed up data-loading. Having it small like 30% of vCPUS would suffice for data loading.
Upvotes: 0
Reputation: 51
I figured out the issue. To solve this, adjust the parameters for the torch.utils.data.DataLoader
pin_memory
num_workers
to 30% of total vCPU (e.g. 1 or 2 for Standard_NC6s_v3)For example:
train_loader = DataLoader(
train_dataset,
batch_size=32,
num_workers=1,
pin_memory=False,
shuffle=True,
)
This issue seems to be related to PyTorch and is due to a deadlock issue. See the details here.
Upvotes: 3