Siladittya
Siladittya

Reputation: 1205

RuntimeError: no support for _allgather_base in Gloo process group

I am trying to run DINOv2 on WIndows. So far, I have been able to run it on windows by using "Gloo" backend for torch.distributed and hasn't faced any issues except the error mentioned in the title.

I followed the change suggested here in https://github.com/pytorch/pytorch/issues/74041, wherein I changed the

    if group is None:
        default_pg = _get_default_group()
        work = default_pg._allgather_base(output_tensor, input_tensor)
    else:
        work = group._allgather_base(output_tensor, input_tensor)

in torch.distributed.distributed_c10d.py (Line no. 2528) in the all_gather_into_tensor function definition

to

    if group is None:
        default_pg = _get_default_group()
        work = default_pg._allgather_base(output_tensor, input_tensor)
    else:
        if group._get_backend_name() == Backend.GLOO:
            return all_gather([output_tensor], input_tensor, group=group)
        else:
            work = group._allgather_base(output_tensor, input_tensor)

But it faces another issue which is also mentioned in this issue https://github.com/pytorch/pytorch/issues/74041, which is https://github.com/pytorch/pytorch/issues/74041#issuecomment-1637434611

As per the comments, it seems that after the all_gather call the parameters are returned as a list and _save_to_state_dict calls a function which expects them as one large tensor. This function appears to be _local_pre_state_dict_hook in torch.distributed.fsdp._state_dict_utils .

The suggested changes in one comment are as follows (https://github.com/pytorch/pytorch/issues/74041#issuecomment-1637508197)

def _local_pre_state_dict_hook(
    fsdp_state: _FSDPState,
    module: nn.Module,
    *args,
    **kwargs,
) -> None:
    """
    Hook that runs before model.state_dict() is called. Right now, pre-state_dict
    hook is not supported by the PyTorch core. So this API is called from
    `_local_post_state_dict_hook()` to simulate the case.
    """
    #if (
    #    _has_fsdp_params(fsdp_state, module)
    #    and not _module_handles(fsdp_state, module)[0].uses_sharded_strategy
    #):
    #    raise RuntimeError(
    #        "``local_state_dict`` can only be used when parameters are flatten "
    #        "and sharded."
    #    )
    _common_pre_state_dict_hook(module, fsdp_state)

But, is there any other way to do it, that is, avoid the error with Gloo backend in the all_gather_into_tensor function and also avoid this error when saving the model and the above mentioned change?

The error can be replicated using the code in this repo https://github.com/sadimanna/dinov2_nosetup_windows where I have modified the DINOv2 code for running on Windows.

But I guess the issue is associated with the gloo backend, rather than Windows OS.

Upvotes: 0

Views: 65

Answers (0)

Related Questions