Reputation: 11
Background
I am finetuning a mistral-7B-instruct-v01 model using the same workflow as is outlined in these two blogposts (using Sagemaker):
Everything works seemingly great, and the fine-tuned models produces results that looks very good. I’m curious about the progress bar however.
As I run the finetuning for a small dataset containing 100 observations with the following setting:
This is the attained progress bar during the fine-tuning:
0%| | 0/4 [00:00<?, ?it/s]
25%|██▌ | 1/4 [00:24<01:14, 24.76s/it]
50%|█████ | 2/4 [00:49<00:49, 24.51s/it]
75%|███████▌ | 3/4 [01:13<00:24, 24.47s/it]
100%|██████████| 4/4 [01:37<00:00, 24.42s/it]
{'train_runtime': 97.8848, 'train_samples_per_second': 0.184, 'train_steps_per_second': 0.041, 'train_loss': 1.038140892982483, 'epoch': 1.78}
100%|██████████| 4/4 [01:37<00:00, 24.42s/it]
100%|██████████| 4/4 [01:37<00:00, 24.47s/it]
When I instead run the fine-tuning on a dataset that contains 10,000 observations the progress bar looks like this (just showing final iterations here):
100%|█████████▉| 491/492 [3:19:46<00:24, 24.41s/it]
100%|██████████| 492/492 [3:20:10<00:00, 24.40s/it]
{'train_runtime': 12010.6264, 'train_samples_per_second': 0.164, 'train_steps_per_second': 0.041, 'train_loss': 0.5181044475819038, 'epoch': 2.0}
100%|██████████| 492/492 [3:20:10<00:00, 24.40s/it]
100%|██████████| 492/492 [3:20:10<00:00, 24.41s/it]
Question I don’t understand the iteration updates in the progress bar.
When having only 100 observation in the finetuning, the number of steps when using two epochs, a batch_size of 1, and gradient accumulation_step of 4, should be 200 / 4 = 50.
Analogously, when we have 10,000 observation, the number of steps should be 20,000 / 4 = 5000.
why is the progress bar showing 4 and 492 iteration steps here?
Code
job_name = f'mistralinstruct-7b-hf-mini'
hyperparameters = {
'dataset_path': '/opt/ml/input/data/training/train_dataset.json',
'model_id': "mistralai/Mistral-7B-Instruct-v0.1",
'max_seq_len': 3872,
'use_qlora': True,
'num_train_epochs': 2,
'per_device_train_batch_size': 1,
'gradient_accumulation_steps': 4,
'gradient_checkpointing': True,
'optim': "adamw_torch_fused",
'logging_steps': 25,
'save_strategy': "steps",
'save_steps' : 100,
'learning_rate': 2e-4,
'bf16': True,
'tf32': True,
'max_grad_norm': 1.0,
'warmup_ratio': 0.03,
'lr_scheduler_type': "constant",
'report_to': "tensorboard",
'output_dir': "/opt/ml/checkpoints",
'merge_adapters': True,
}
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket='com.ravenpack.dsteam.research.testing'
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
print(sagemaker_session_bucket)
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='SageMaker-ds-research-testing')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
tensorboard_output_config = TensorBoardOutputConfig(
container_local_output_path='/opt/ml/output/tensorboard',
s3_output_path = f's3://{sess.default_bucket()}/...{my_path}...',
)
metric_definitions = [
{'Name': 'loss', 'Regex': "'loss':\s*([0-9\\.]+)"},
{'Name': 'grad_norm', 'Regex': "'grad_norm':\s*([0-9\\.]+)"},
{'Name': 'learning_rate', 'Regex': "'learning_rate':\s*([0-9\\.]+)"},
{'Name': 'epoch', 'Regex': "'epoch':\s*([0-9\\.]+)"}
]
# create the Estimator
huggingface_estimator = HuggingFace(
entry_point = 'run_sft.py', # train script (used Philip's from https://github.com/philschmid/llm-sagemaker-sample/blob/main/scripts/trl/run_sft.py)
source_dir = '...{my_path}...',
instance_type = 'ml.g5.4xlarge',
instance_count = 1,
max_run = 1*24*60*60,
max_wait = 2*24*60*60,
use_spot_instances = True,
base_job_name = job_name,
role = role,
volume_size = 300,
transformers_version = '4.36',
pytorch_version = '2.1',
py_version = 'py310',
hyperparameters = hyperparameters,
disable_output_compression = True,
environment = {
"HUGGINGFACE_HUB_CACHE": "/tmp/.cache",
},
metric_definitions = metric_definitions,
tensorboard_output_config = tensorboard_output_config,
checkpoint_s3_uri = f's3://{sess.default_bucket()}/...{my_path}...',
)
training_input_path = f's3://{sess.default_bucket()}/...{my_path}...'
data = {'training': training_input_path}
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit(data, wait=True)
Upvotes: 1
Views: 170
Reputation: 267
Does this issue happen only in the SageMaker notebook environment? or does this also happen in simpler terminal environment? (e.g. SSH session)
I found similar post here: progress bar in jupyter notebook go crazy
Depending on how to render the character based progress bar, there could be similar issue where multiple lines are rendered unintentionally.
Upvotes: 0