Progress bar when launching training jobs in SageMaker does not match with the number of steps for the full training

Question

Background

I am finetuning a mistral-7B-instruct-v01 model using the same workflow as is outlined in these two blogposts (using Sagemaker):

Everything works seemingly great, and the fine-tuned models produces results that looks very good. I’m curious about the progress bar however.

As I run the finetuning for a small dataset containing 100 observations with the following setting:

num_train_epochs = 2
per_device_batch_size = 1
gradient_accumulation_steps = 4
SM_LOG_LEVEL=20
logging_steps = 25

This is the attained progress bar during the fine-tuning:

0%|          | 0/4 [00:00


When I instead run the fine-tuning on a dataset that contains 10,000 observations the progress bar looks like this (just showing final iterations here):
100%|█████████▉| 491/492 [3:19:46<00:24, 24.41s/it]
100%|██████████| 492/492 [3:20:10<00:00, 24.40s/it]
{'train_runtime': 12010.6264, 'train_samples_per_second': 0.164, 'train_steps_per_second': 0.041, 'train_loss': 0.5181044475819038, 'epoch': 2.0}
100%|██████████| 492/492 [3:20:10<00:00, 24.40s/it]
100%|██████████| 492/492 [3:20:10<00:00, 24.41s/it]

Question
I don’t understand the iteration updates in the progress bar.
When having only 100 observation in the finetuning, the number of steps when using two epochs, a batch_size of 1, and gradient accumulation_step of 4, should be 200 / 4 = 50.
Analogously, when we have 10,000 observation, the number of steps should be 20,000 / 4 = 5000.
why is the progress bar showing 4 and 492 iteration steps here?
Code
job_name = f'mistralinstruct-7b-hf-mini'

hyperparameters = {
  'dataset_path': '/opt/ml/input/data/training/train_dataset.json',
  'model_id': "mistralai/Mistral-7B-Instruct-v0.1",
  'max_seq_len': 3872,
  'use_qlora': True,
  'num_train_epochs': 2,
  'per_device_train_batch_size': 1,
  'gradient_accumulation_steps': 4,
  'gradient_checkpointing': True,
  'optim': "adamw_torch_fused",
  'logging_steps': 25,
  'save_strategy': "steps",
  'save_steps' : 100,
  'learning_rate': 2e-4,
  'bf16': True,
  'tf32': True, 
  'max_grad_norm': 1.0,
  'warmup_ratio': 0.03,
  'lr_scheduler_type': "constant",
  'report_to': "tensorboard",
  'output_dir': "/opt/ml/checkpoints",
  'merge_adapters': True,
}


sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket='com.ravenpack.dsteam.research.testing'
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()
print(sagemaker_session_bucket)

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='SageMaker-ds-research-testing')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)


tensorboard_output_config = TensorBoardOutputConfig(
    container_local_output_path='/opt/ml/output/tensorboard',
    s3_output_path = f's3://{sess.default_bucket()}/...{my_path}...',
)


metric_definitions = [
    {'Name': 'loss', 'Regex': "'loss':\s*([0-9\.]+)"},
    {'Name': 'grad_norm', 'Regex': "'grad_norm':\s*([0-9\.]+)"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate':\s*([0-9\.]+)"},
    {'Name': 'epoch', 'Regex': "'epoch':\s*([0-9\.]+)"}
]


# create the Estimator
huggingface_estimator = HuggingFace(
    entry_point          = 'run_sft.py',    # train script (used Philip's from https://github.com/philschmid/llm-sagemaker-sample/blob/main/scripts/trl/run_sft.py)
    source_dir           = '...{my_path}...', 
    instance_type        = 'ml.g5.4xlarge',
    instance_count       = 1,             
    max_run              = 1*24*60*60,
    max_wait             = 2*24*60*60,       
    use_spot_instances   = True,
    base_job_name        = job_name,         
    role                 = role,
    volume_size          = 300,
    transformers_version = '4.36',
    pytorch_version      = '2.1',
    py_version           = 'py310',
    hyperparameters      =  hyperparameters,
    disable_output_compression = True,
    environment          = {
                            "HUGGINGFACE_HUB_CACHE": "/tmp/.cache",
                            },
    metric_definitions   = metric_definitions,
    tensorboard_output_config = tensorboard_output_config,
    
    checkpoint_s3_uri = f's3://{sess.default_bucket()}/...{my_path}...',
)   


training_input_path = f's3://{sess.default_bucket()}/...{my_path}...'


data = {'training': training_input_path}

# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)

Progress bar when launching training jobs in SageMaker does not match with the number of steps for the full training

Answers (1)

Related Questions