Charlie Parker
Charlie Parker

Reputation: 5111

How does one use accelerate with the hugging face (HF) trainer?

What are the code changes one has to do to run accelerate with a trianer? I keep seeing:

from accelerate import Accelerator

accelerator = Accelerator()

model, optimizer, training_dataloader, scheduler = accelerator.prepare(
    model, optimizer, training_dataloader, scheduler
)

for batch in training_dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    accelerator.backward(loss)
    optimizer.step()
    scheduler.step()

but when I tried the analogous thing it didn't work:

!pip
install
accelerate
!pip
install
datasets
!pip
install
transformers

# %%
from accelerate import Accelerator
from datasets import load_dataset
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, TrainingArguments, Trainer

# Initialize accelerator
accelerator = Accelerator()

# Specify dataset
dataset = load_dataset('imdb')

# Specify tokenizer and model
tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.to(accelerator.device)


# Tokenize and format dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, max_length=512)


tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=accelerator.num_processes,
    remove_columns=["text"]
)

# Training configuration
training_args = TrainingArguments(
    output_dir="output",
    overwrite_output_dir=True,
    # num_train_epochs=3,
    max_steps=10,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    fp16=False,  # Set to True for mixed precision training (FP16)
    fp16_full_eval=False,  # Set to True for mixed precision evaluation (FP16)
    dataloader_num_workers=accelerator.num_processes,  # Use multiple processes for data loading
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

# Train model
trainer.train()

why?

related:

Upvotes: 5

Views: 12350

Answers (3)

Proton Boss
Proton Boss

Reputation: 145

Load your model like this with PartialState():


model = AutoModelForCausalLM.from_pretrained(
  model_name,
  low_cpu_mem_usage=True,
  torch_dtype=torch.float16,
  load_in_4bit=True,
  device_map={"": PartialState().process_index})

No need to call accelerate.prepare() function on your trainer object.

Full psuedocode:


tokenizer= ...


model = AutoModelForCausalLM.from_pretrained(
  model_name,
  low_cpu_mem_usage=True,
  torch_dtype=torch.float16,
  load_in_4bit=True,
  device_map={"": PartialState().process_index})


trainer_arguments=...


trainer=Trainer(...)


trainer.train()

Now launch your job like this

accelerate launch --config_file {path/to/config/my_config_file.yaml} {script_name.py} {--arg1} {--arg2} ...

Note: While configuring accelerate config, make sure give GPU Ids with comma separation (do not write all).

Upvotes: 0

arkoi
arkoi

Reputation: 116

For me, after several iterations and rewriting complete training loop to use Accelerate, I realized that I do not need to do any change to my code with Trainer. I just need to wrap Trainer inside accelerator:

For sure, I need to import accelerate first:

    from accelerate import Accelerator

For example Trainer part:

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_eval_dataset,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
    )
    trainer.train()

should be:

    accelerator = Accelerator()
    trainer = accelerator.prepare(Trainer(
        model=model,
        args=args,
        train_dataset=tokenized_train_dataset,
        eval_dataset=tokenized_eval_dataset,
        data_collator=data_collator,
        compute_metrics=compute_metrics,
        tokenizer=tokenizer,
    ))
    trainer.train()

And that is it. Now you need to call it from command line by accelerate launch command. Before accelerate launch, you need to have config file for accelerate. To create one: write in command line:

accelerate config

and it will start to ask you for your configuration, question-by-question. I have 2 GPUs inside one machine, and my answers look like this:

In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0

Which type of machine are you using? ([0] No distributed training, [1] multi-CPU, [2] multi-GPU, [3] TPU): 2

How many different machines will you use (use more than 1 for multi-node training)? [1]: 1

Do you want to use DeepSpeed? [yes/NO]: no

Do you want to use FullyShardedDataParallel? [yes/NO]: no

How many GPU(s) should be used for distributed training? [1]: 2

Do you wish to use FP16 or BF16 (mixed precision)? [NO/fp16/bf16]: no

And then you need to call python code using accelerate launch:

accelerate launch --config_file {path/default_config.yaml} main.py
    

Additionally, in my case, I wanted to run several scripts by accelerate at the same time, the only change I needed is in accelerate launch command, I needed to specify different port for each script. For example, my complete command for script 1 was:

accelerate launch --config_file C:\Users\user/.cache\huggingface\accelerate\default_config.yaml --main_process_port 29501 ./python_code1.py

for script 2:

accelerate launch --config_file C:\Users\user/.cache\huggingface\accelerate\default_config.yaml --main_process_port 29502 ./python_code2.py

For those using Windows, where we do not have NCCL support, you need to use gloo: First, import package:

    import torch.distributed as dist
    import os

then set backend to gloo:

    dist.init_process_group(backend='gloo')

That worked for me, hopefully, someone will find it useful.

Upvotes: 6

Charlie Parker
Charlie Parker

Reputation: 5111

Since the trainer already has created an accelerator obj inside it's own code you have to do no code changes except for writing your own accelerate config and calling it as :

accelerate launch --config_file {path/to/config/my_config_file.yaml} {script_name.py} {--arg1} {--arg2} ...

An example config is given at the end.

also e.g., cmds

accelerate launch -m pdb --config_file {path/to/config/my_config_file.yaml} {path/to/script_name.py} {--arg1} {--arg2} ...

accelerate launch -m pdb --config_file ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/sweep_configs_falcon7b_fft/falcon_accelerate_hyperturing.yaml ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/main_falcon_uu.py --report_to none --path2config ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/sweep_configs_falcon7b_fft/falcon_debug_config.yaml
accelerate launch -m pdb --config_file ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/sweep_configs_falcon7b_fft/falcon_accelerate_hyperturing.yaml ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/main_falcon_uu.py --report_to wandb --path2config ~/ultimate-utils/ultimate-utils-proj-src/uutils/hf_uu/mains_hf/falcon_uu/sweep_configs_falcon7b_fft/falcon_sweep_config.yaml


Long answer

My assumption was that there would be code changes, since every other accelerate tutorial showed that e.g.,

+ from accelerate import Accelerator
  from transformers import AdamW, AutoModelForSequenceClassification, get_scheduler

+ accelerator = Accelerator()

  model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)
  optimizer = AdamW(model.parameters(), lr=3e-5)

- device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
- model.to(device)

+ train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
+     train_dataloader, eval_dataloader, model, optimizer
+ )

  num_epochs = 3
  num_training_steps = num_epochs * len(train_dataloader)
  lr_scheduler = get_scheduler(
      "linear",
      optimizer=optimizer,
      num_warmup_steps=0,
      num_training_steps=num_training_steps
  )

  progress_bar = tqdm(range(num_training_steps))

  model.train()
  for epoch in range(num_epochs):
      for batch in train_dataloader:
-         batch = {k: v.to(device) for k, v in batch.items()}
          outputs = model(**batch)
          loss = outputs.loss
-         loss.backward()
+         accelerator.backward(loss)

          optimizer.step()
          lr_scheduler.step()
          optimizer.zero_grad()
          progress_bar.update(1)

but those code changes are already inside the Trainer. Their integration is so seamless it's unclear, or perhaps it's just not in the tutorials so one has to look at their trainer code e.g.,

if is_accelerate_available():
    from accelerate import __version__ as accelerate_version

    if version.parse(accelerate_version) >= version.parse("0.16"):
        from accelerate import skip_first_batches

    from accelerate import Accelerator
    from accelerate.uti

So just make an accelerate config and run it e.g.,

# -----> see this ref: https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config
# ref for fsdp to know how to change fsdp opts: https://huggingface.co/docs/accelerate/usage_guides/fsdp
# ref for accelerate to know how to change accelerate opts: https://huggingface.co/docs/accelerate/basic_tutorials/launch
# ref alpaca accelerate config: https://github.com/tatsu-lab/alpaca_farm/tree/main/examples/accelerate_configs

main_training_function: main  # <- change

deepspeed_config: { }
distributed_type: FSDP
downcast_bf16: 'no'
dynamo_backend: 'NO'
# seems alpaca was based on: https://huggingface.co/docs/accelerate/usage_guides/fsdp
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  #  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer  # <-change
  fsdp_transformer_layer_cls_to_wrap: FalconDecoderLayer  # <-change
#  fsdp_min_num_params:  7e9 # e.g., suggested heuristic: num_params / num_gpus = params/gpu, multiply by precision in bytes to know GBs used
gpu_ids: null
machine_rank: 0
main_process_ip: null
main_process_port: null
megatron_lm_config: { }
#mixed_precision: 'bf16'
#mixed_precision: 'no'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_name: null
tpu_zone: null
use_cpu: false

Upvotes: 5

Related Questions