Reputation: 1205
I am trying to run DINOv2 on WIndows. So far, I have been able to run it on windows by using "Gloo" backend for torch.distributed
and hasn't faced any issues except the error mentioned in the title.
I followed the change suggested here in https://github.com/pytorch/pytorch/issues/74041, wherein I changed the
if group is None:
default_pg = _get_default_group()
work = default_pg._allgather_base(output_tensor, input_tensor)
else:
work = group._allgather_base(output_tensor, input_tensor)
in torch.distributed.distributed_c10d.py
(Line no. 2528) in the all_gather_into_tensor
function definition
to
if group is None:
default_pg = _get_default_group()
work = default_pg._allgather_base(output_tensor, input_tensor)
else:
if group._get_backend_name() == Backend.GLOO:
return all_gather([output_tensor], input_tensor, group=group)
else:
work = group._allgather_base(output_tensor, input_tensor)
But it faces another issue which is also mentioned in this issue https://github.com/pytorch/pytorch/issues/74041, which is https://github.com/pytorch/pytorch/issues/74041#issuecomment-1637434611
As per the comments, it seems that after the all_gather call the parameters are returned as a list and _save_to_state_dict calls a function which expects them as one large tensor. This function appears to be _local_pre_state_dict_hook in torch.distributed.fsdp._state_dict_utils .
The suggested changes in one comment are as follows (https://github.com/pytorch/pytorch/issues/74041#issuecomment-1637508197)
def _local_pre_state_dict_hook(
fsdp_state: _FSDPState,
module: nn.Module,
*args,
**kwargs,
) -> None:
"""
Hook that runs before model.state_dict() is called. Right now, pre-state_dict
hook is not supported by the PyTorch core. So this API is called from
`_local_post_state_dict_hook()` to simulate the case.
"""
#if (
# _has_fsdp_params(fsdp_state, module)
# and not _module_handles(fsdp_state, module)[0].uses_sharded_strategy
#):
# raise RuntimeError(
# "``local_state_dict`` can only be used when parameters are flatten "
# "and sharded."
# )
_common_pre_state_dict_hook(module, fsdp_state)
But, is there any other way to do it, that is, avoid the error with Gloo backend in the all_gather_into_tensor
function and also avoid this error when saving the model and the above mentioned change?
The error can be replicated using the code in this repo https://github.com/sadimanna/dinov2_nosetup_windows where I have modified the DINOv2 code for running on Windows.
But I guess the issue is associated with the gloo backend, rather than Windows OS.
Upvotes: 0
Views: 65