Fine-Tuning Whisper for translate task "speech in cutom dialect to translated text in another custom language"

Question

I recently fine-tuned the Whisper-Tiny model on my custom speech dataset for transcription tasks, and it worked well. However, when I tried to fine-tune the model for a translation task using the same dataset, I ran into some trouble. To start, I only used a small portion of my dataset for the initial fine-tuning to verify that the pipeline was set up correctly before tweaking the training parameters.

The model training completed without any errors, but during inference, the output is always in English, which is not what I need. My goal is to have the output in Arabic script.

I’m seeking advice on what might have gone wrong during the fine-tuning process. Any suggestions or insights would be greatly appreciated.


from datasets import load_dataset, DatasetDict

common_voice = DatasetDict()
from datasets import load_dataset, load_metric, Audio, DatasetDict, Dataset


my_voice_dataset_train = DatasetDict()
my_voice_dataset_test = DatasetDict()

fileList = ["wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav","wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav","wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav","wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav",]
sentenceList = ["واش كملتي؟", "مسا الخير" , "جفنه","واش كملتي؟", "مسا الخير" , "جفنه","واش كملتي؟", "مسا الخير" , "جفنه","واش كملتي؟", "مسا الخير" , "جفنه"]
sentenceTranslationList = ["هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس","هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس","هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس","هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس"]
import torchaudio

def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch)
    return {"array":speech_array[0].numpy(),"sampling_rate":sampling_rate}

common_voice["train"] = Dataset.from_dict({"path": fileList, "sentence": sentenceList, "translation":sentenceTranslationList, 'audio':[ speech_file_to_array_fn(f)  for f in fileList]})
common_voice["test"]= Dataset.from_dict({"path": fileList, "sentence": sentenceList, "translation":sentenceTranslationList, 'audio':[ speech_file_to_array_fn(f)  for f in fileList]})
 

"""## Prepare Feature Extractor, Tokenizer and Data

We'll go through details for setting-up the feature extractor and tokenizer one-by-one!

### Load WhisperFeatureExtractor
 

We'll load the feature extractor from the pre-trained checkpoint with the default values:
"""

from transformers import WhisperFeatureExtractor

feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")

"""### Load WhisperTokenizer 
"""

from transformers import WhisperTokenizer

tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Arabic", task="translate")

"""### Combine To Create A WhisperProcessor

To simplify using the feature extractor and tokenizer, we can _wrap_
both into a single `WhisperProcessor` class. This processor object
inherits from the `WhisperFeatureExtractor` and `WhisperProcessor`,
and can be used on the audio inputs and model predictions as required.
In doing so, we only need to keep track of two objects during training:
the `processor` and the `model`:
"""

from transformers import WhisperProcessor

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="Arabic", task="translate")

"""### Prepare Data

Let's print the first example of the Common Voice dataset to see
what form the data is in:
"""


from datasets import Audio

common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

"""Re-loading the first audio sample in the Common Voice dataset will resample
it to the desired sampling rate:
"""

print(common_voice["train"][0])

"""Now we can write a function to prepare our data ready for the model:
1. We load and resample the audio data by calling `batch["audio"]`. As explained above, 🤗 Datasets performs any necessary resampling operations on the fly.
2. We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.
3. We encode the transcriptions to label ids through the use of the tokenizer.
"""

def prepare_dataset(batch):
    # load and resample audio data from 48 to 16kHz
    audio = batch["audio"]
    # compute log-Mel input features from input audio array
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    # encode target text to label ids
    batch["labels"] = tokenizer(batch["translation"]).input_ids
    return batch



"""We can apply the data preparation function to all of our training examples using dataset's `.map` method. The argument `num_proc` specifies how many CPU cores to use. Setting `num_proc` > 1 will enable multiprocessing. If the `.map` method hangs with multiprocessing, set `num_proc=1` and process the dataset sequentially."""

common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"])

"""## Training and Evaluation
 
Once we've fine-tuned the model, we will evaluate it on the test data to verify that we have correctly trained it
to translate speech in Darija.

### Load a Pre-Trained Checkpoint

We'll start our fine-tuning run from the pre-trained Whisper `small` checkpoint,
the weights for which we need to load from the Hugging Face Hub. Again, this
is trivial through use of 🤗 Transformers!
"""

from transformers import WhisperForConditionalGeneration

model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.generation_config.language = "Arabic"
model.generation_config.task = "translate"
model.generation_config.forced_decoder_ids =None

"""### Define a Data Collator """



import torch

from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any
    decoder_start_token_id: int
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        # split inputs and labels since they have to be of different lengths and need different padding methods
        # first treat the audio inputs by simply returning torch tensors
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
        # get the tokenized label sequences
        label_features = [{"input_ids": feature["labels"]} for feature in features]
        # pad the labels to max length
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
        # replace padding with -100 to ignore loss correctly
        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        # if bos token is appended in previous tokenization step,
        # cut bos token here as it's append later anyways
        if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels
        return batch

"""Let's initialise the data collator we've just defined:"""

data_collator = DataCollatorSpeechSeq2SeqWithPadding(
    processor=processor,
    decoder_start_token_id=model.config.decoder_start_token_id,
)

"""### Evaluation Metrics

We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing
ASR systems. For more information, refer to the WER [docs](https://huggingface.co/metrics/wer). We'll load the WER metric from 🤗 Evaluate:
"""

import evaluate

metric = evaluate.load("bleu")

"""We then simply have to define a function that takes our model
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:
"""

def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    # Replace -100 with the pad_token_id
    label_ids[label_ids == -100] = tokenizer.pad_token_id
    # Decode predictions and labels
    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    # Tokenize the outputs for BLEU metric
    pred_str = [p.split() for p in pred_str]
    label_str = [[l.split()] for l in label_str]

    # Calculate BLEU score
    bleu = metric.compute(predictions=pred_str, references=label_str)
    return {"bleu": bleu["bleu"]}

"""### Define the Training Configuration

In the final step, we define all the parameters related to training. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments).
"""

from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
    output_dir="whisper-tiny-darija-translate",  # change to a repo name of your choice
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,
    warmup_steps=25,
    max_steps=25,
    gradient_checkpointing=False,
    fp16=True,
    evaluation_strategy="steps",
    per_device_eval_batch_size=1,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=5,
    eval_steps=5,
    logging_steps=5,
    report_to=["tensorboard"],
    load_best_model_at_end=True,
    metric_for_best_model="bleu",
    greater_is_better=False,
    push_to_hub=False,
)


"""**Note**: if one does not want to upload the model checkpoints to the Hub,
set `push_to_hub=False`.

We can forward the training arguments to the 🤗 Trainer along with our model,
dataset, data collator and `compute_metrics` function:
"""

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)


"""We'll save the processor object once before starting training. Since the processor is not trainable, it won't change over the course of training:"""

processor.save_pretrained(training_args.output_dir)

"""### Training
To launch training, simply execute:
"""

trainer.train()


kwargs = {
    "dataset_tags": "darija-c",
    "dataset": "Darija-C",  # a 'pretty' name for the training dataset
    "dataset_args": "config: ar, split: test",
    "language": "ar",
    "model_name": "Whisper tiny darija translate",  # a 'pretty' name for our model
    "finetuned_from": "openai/whisper-tiny",
    "tasks": "automatic-speech-recognition",
}



trainer.save_model("./whisper-tiny-darija-translate")

"""## Building a Demo

Now that we've fine-tuned our model we can build a demo to show
off its Translate capabilities! We'll make use of 🤗 Transformers
`pipeline`, which will take care of the entire ASR pipeline,
right from pre-processing the audio inputs to decoding the
model predictions.

take the first audio of our dataset then input it to
our fine-tuned Whisper model to translate the corresponding text:
"""

from transformers import pipeline
import gradio as gr
from transformers import WhisperTokenizer
from transformers import pipeline, WhisperForConditionalGeneration, WhisperProcessor

def test_model():
    model = WhisperForConditionalGeneration.from_pretrained("whisper-tiny-darija-translate")
    processor = WhisperProcessor.from_pretrained("whisper-tiny-darija-translate")
    tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Arabic", task="translate")
    pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=tokenizer,
                    feature_extractor=processor.feature_extractor)
    audio=fileList[0]
    text = pipe(audio)["text"]
    print("Reference input : ",sentenceTranslationList[0])
    print("translation output : ",text)

test_model()

inference output

enter image description here

I need my model to be able to take an audio file in my custom language then output the translation in another language (ex: arabic).

Fine-Tuning Whisper for translate task "speech in cutom dialect to translated text in another custom language"

inference output

Answers (0)

Related Questions

Fine-Tuning Whisper for translate task &quot;speech in cutom dialect to translated text in another custom language&quot;

inference output

Answers (0)

Related Questions

Fine-Tuning Whisper for translate task "speech in cutom dialect to translated text in another custom language"