Reputation: 1
I am trying to run a multi-node training job using PyTorch's DistributedDataParallel (DDP) following this guide. However, when I launch the job with torchrun, I encounter the following NCCL error on the worker node(s):
[rank4]: Traceback (most recent call last):
[rank4]: File "/home/user/workspace/ddp/main.py", line 159, in <module>
[rank4]: main()
[rank4]: File "/home/user/workspace/ddp/main.py", line 90, in main
[rank4]: ddp_model = torch.nn.parallel.DistributedDataParallel(model, find_unused_parameters=True, device_ids=[LOCAL_RANK], output_device=LOCAL_RANK)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/nn/parallel/distributed.py", line 825, in __init__
[rank4]: _verify_param_shape_across_processes(self.process_group, parameters)
[rank4]: File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/utils.py", line 294, in _verify_param_shape_across_processes
[rank4]: return dist._verify_params_across_processes(process_group, tensors, logger)
[rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank4]: torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.hpp:268, internal error - please report this issue to the NCCL developers, NCCL version 2.21.5
[rank4]: ncclInternalError: Internal check failed.
[rank4]: Last error:
[rank4]: Bootstrap : no socket interface found
[rank4]:[W131 14:34:49.202068506 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0131 14:34:49.846516 2700574 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2700596 closing signal SIGTERM
W0131 14:34:49.847558 2700574 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 2700598 closing signal SIGTERM
E0131 14:34:49.944460 2700574 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 2700595) of binary: /home/user/workspace/ddp/.venv3.11/bin/python3.11
Traceback (most recent call last):
File "/home/user/workspace/ddp/.venv3.11/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/workspace/ddp/.venv3.11/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2025-01-31_14:34:49
host : *****
rank : 6 (local_rank: 2)
exitcode : 1 (pid: 2700597)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-01-31_14:34:49
host : *****
rank : 4 (local_rank: 0)
exitcode : 1 (pid: 2700595)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Environment:
I tried changing the DDP backend from nccl to gloo in my argument parser:
parser.add_argument("--backend", type=str, default="nccl", choices=["nccl", "gloo", "mpi"], help="DDP backend")
When I set --backend=gloo, the script runs without errors, but it runs on the CPU instead of the GPU. Since I need GPU acceleration, I must use nccl, but that's where the error occurs.
Upvotes: 0
Views: 53