Reputation: 301
I am not able to initialize the group process in PyTorch for BERT model I had tried to initialize using following code:
import torch
import datetime
torch.distributed.init_process_group(
backend='nccl',
init_method='env://',
timeout=datetime.timedelta(0, 1800),
world_size=0,
rank=0,
store=None,
group_name=''
)
and tried to access the get_world_size()
function:
num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
full code:
train_examples = None
num_train_optimization_steps = None
if do_train:
train_examples = processor.get_train_examples(data_dir)
num_train_optimization_steps = int(
len(train_examples) / train_batch_size / gradient_accumulation_steps) * num_train_epochs
if local_rank != -1:
import datetime
torch.distributed.init_process_group(backend='nccl',init_method='env://', timeout=datetime.timedelta(0, 1800), world_size=0, rank=0, store=None, group_name='')
num_train_optimization_steps = num_train_optimization_steps // torch.distributed.get_world_size()
print(num_train_optimization_steps)
Upvotes: 14
Views: 47187
Reputation: 81
How to do the setup for distributed training is defined here by PyTorch -> https://huggingface.co/blog/pytorch-ddp-accelerate-transformers
But you could also do the setting up by adding following lines to your code
import os
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"
dist.init_process_group(backend='nccl', init_method='env://', rank = torch.cuda.device_count(), world_size = 1)
Upvotes: 4
Reputation: 146
you can also add these lines to your script if you want to run the script in native python (helpful for debugging purposes)
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
Upvotes: 0
Reputation: 101
Just an update, instead of running:
$ python -m torch.distributed.launch --use_env train_script.py
You now only need to run:
$ torchrun train_script.py
As indicated here.
Upvotes: 10
Reputation: 151
I solve the problem by referring https://github.com/NVIDIA/apex/issues/99. Specifically run
python -m torch.distributed.launch xxx.py
Upvotes: 15