Stelios M
Stelios M

Reputation: 160

Enhance a MarianMT pretrained model from HuggingFace with more training data

I am using a pretrained MarianMT machine translation model from English to German. I also have a large set of high quality English-to-German sentence pairs that I would like to use to enhance the performance of the model, which is trained on the OPUS corpus, but without making the model forget the OPUS training data. Is there a way to do that? Thanks.

Upvotes: 1

Views: 1675

Answers (2)

Felix Mueller
Felix Mueller

Reputation: 362

I did the finetuning as described here: https://github.com/huggingface/transformers/tree/master/examples/seq2seq#translation

To train the model (fr to de) and do evaluation at the end: python examples/seq2seq/run_translation.py --do_train True --do_eval True --model_name_or_path Helsinki-NLP/opus-mt-de-fr --source_lang de --target_lang fr --source_prefix "translate German to French: " --train_file ../data/translations-train-de-fr1.json --validation_file ../data/translations-val-de-fr1.json --output_dir ../tst-translation-models --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir True --predict_with_generate True --fp16 True

The trained model gets stored into the folder: tst-translation-models

Using the finetuned model to do evaluation only: folder use 'copy_mode.sh', which must be adapted per language): python examples/seq2seq/run_translation.py --do_train False --do_eval True --model_name_or_path ../tst-translation-models --source_lang de --target_lang fr --source_prefix "translate German to French: " --validation_file ../data/translations-val-de-fr1.json --per_device_eval_batch_size=4 --predict_with_generate True --fp16 True

Regards, Felix

PS: Note that the training and valudation data must be in the following form (one such entry per row): { "translation": { "de": "Freilegung der Leitung (durch VN installiert)", "fr": "Dégagement de la conduite (installée par le PA)" } }

Upvotes: 1

chilukrn
chilukrn

Reputation: 46

Have you tried the finetune.sh script shown here? In addition to the short list of CLI flags listed there, you could try adding:

--src_lang "en" \
--tgt_lang "de" \
--num_train_epochs 400 \
--warmup_steps 20 \
--train_batch_size 32 \
--eval_batch_size 32 \
--data_dir "/data/dir" \
--output_dir "/path/to/store/model/etc" \
--cache_dir "/path/for/misc/files" \
--max_source_length 128 \
--max_target_length 128 \
--val_max_target_length 128 \
--test_max_target_length 128 \
--model_name_or_path "</path/to/pretrained>"

where the "/path/to/pretrained" could be either a local path on your machine or MarianMT model (Opus-en-de or equivalent). The "data/dir" has a "train.source" and "train.target" for the source & target languages, such that line number x of the target is a translation of line x in the source (and same with "val.source" and "val.target"). I have changed the finetune.py script here to

parser = TranslationModule.add_model_specific_args(parser, os.getcwd())

and then ran the finetune.sh script.

Note: The gradients blew up when I used the "fp16" flag (with Pytorch 1.6), so I had removed it. Also, you might want to check on the "val_check_interval", "check_val_every_n_epoch", and probably check this issue on how to save multiple checkpoints.

Upvotes: 3

Related Questions