Reputation: 11
I recently fine-tuned the Whisper-Tiny model on my custom speech dataset for transcription tasks, and it worked well. However, when I tried to fine-tune the model for a translation task using the same dataset, I ran into some trouble. To start, I only used a small portion of my dataset for the initial fine-tuning to verify that the pipeline was set up correctly before tweaking the training parameters.
The model training completed without any errors, but during inference, the output is always in English, which is not what I need. My goal is to have the output in Arabic script.
I’m seeking advice on what might have gone wrong during the fine-tuning process. Any suggestions or insights would be greatly appreciated.
from datasets import load_dataset, DatasetDict
common_voice = DatasetDict()
from datasets import load_dataset, load_metric, Audio, DatasetDict, Dataset
my_voice_dataset_train = DatasetDict()
my_voice_dataset_test = DatasetDict()
fileList = ["wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav","wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav","wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav","wav/dar_307.wav", "wav/dar_308.wav", "wav/dar_309.wav",]
sentenceList = ["واش كملتي؟", "مسا الخير" , "جفنه","واش كملتي؟", "مسا الخير" , "جفنه","واش كملتي؟", "مسا الخير" , "جفنه","واش كملتي؟", "مسا الخير" , "جفنه"]
sentenceTranslationList = ["هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس","هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس","هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس","هل انتهيت؟", "مسا الخير" , "دلو غسيل الملابس"]
import torchaudio
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch)
return {"array":speech_array[0].numpy(),"sampling_rate":sampling_rate}
common_voice["train"] = Dataset.from_dict({"path": fileList, "sentence": sentenceList, "translation":sentenceTranslationList, 'audio':[ speech_file_to_array_fn(f) for f in fileList]})
common_voice["test"]= Dataset.from_dict({"path": fileList, "sentence": sentenceList, "translation":sentenceTranslationList, 'audio':[ speech_file_to_array_fn(f) for f in fileList]})
"""## Prepare Feature Extractor, Tokenizer and Data
We'll go through details for setting-up the feature extractor and tokenizer one-by-one!
### Load WhisperFeatureExtractor
We'll load the feature extractor from the pre-trained checkpoint with the default values:
"""
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-tiny")
"""### Load WhisperTokenizer
"""
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Arabic", task="translate")
"""### Combine To Create A WhisperProcessor
To simplify using the feature extractor and tokenizer, we can _wrap_
both into a single `WhisperProcessor` class. This processor object
inherits from the `WhisperFeatureExtractor` and `WhisperProcessor`,
and can be used on the audio inputs and model predictions as required.
In doing so, we only need to keep track of two objects during training:
the `processor` and the `model`:
"""
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny", language="Arabic", task="translate")
"""### Prepare Data
Let's print the first example of the Common Voice dataset to see
what form the data is in:
"""
from datasets import Audio
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
"""Re-loading the first audio sample in the Common Voice dataset will resample
it to the desired sampling rate:
"""
print(common_voice["train"][0])
"""Now we can write a function to prepare our data ready for the model:
1. We load and resample the audio data by calling `batch["audio"]`. As explained above, 🤗 Datasets performs any necessary resampling operations on the fly.
2. We use the feature extractor to compute the log-Mel spectrogram input features from our 1-dimensional audio array.
3. We encode the transcriptions to label ids through the use of the tokenizer.
"""
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# encode target text to label ids
batch["labels"] = tokenizer(batch["translation"]).input_ids
return batch
"""We can apply the data preparation function to all of our training examples using dataset's `.map` method. The argument `num_proc` specifies how many CPU cores to use. Setting `num_proc` > 1 will enable multiprocessing. If the `.map` method hangs with multiprocessing, set `num_proc=1` and process the dataset sequentially."""
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"])
"""## Training and Evaluation
Once we've fine-tuned the model, we will evaluate it on the test data to verify that we have correctly trained it
to translate speech in Darija.
### Load a Pre-Trained Checkpoint
We'll start our fine-tuning run from the pre-trained Whisper `small` checkpoint,
the weights for which we need to load from the Hugging Face Hub. Again, this
is trivial through use of 🤗 Transformers!
"""
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model.generation_config.language = "Arabic"
model.generation_config.task = "translate"
model.generation_config.forced_decoder_ids =None
"""### Define a Data Collator """
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any
decoder_start_token_id: int
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lengths and need different padding methods
# first treat the audio inputs by simply returning torch tensors
input_features = [{"input_features": feature["input_features"]} for feature in features]
batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
# get the tokenized label sequences
label_features = [{"input_ids": feature["labels"]} for feature in features]
# pad the labels to max length
labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
# if bos token is appended in previous tokenization step,
# cut bos token here as it's append later anyways
if (labels[:, 0] == self.decoder_start_token_id).all().cpu().item():
labels = labels[:, 1:]
batch["labels"] = labels
return batch
"""Let's initialise the data collator we've just defined:"""
data_collator = DataCollatorSpeechSeq2SeqWithPadding(
processor=processor,
decoder_start_token_id=model.config.decoder_start_token_id,
)
"""### Evaluation Metrics
We'll use the word error rate (WER) metric, the 'de-facto' metric for assessing
ASR systems. For more information, refer to the WER [docs](https://huggingface.co/metrics/wer). We'll load the WER metric from 🤗 Evaluate:
"""
import evaluate
metric = evaluate.load("bleu")
"""We then simply have to define a function that takes our model
predictions and returns the WER metric. This function, called
`compute_metrics`, first replaces `-100` with the `pad_token_id`
in the `label_ids` (undoing the step we applied in the
data collator to ignore padded tokens correctly in the loss).
It then decodes the predicted and label ids to strings. Finally,
it computes the WER between the predictions and reference labels:
"""
def compute_metrics(pred):
pred_ids = pred.predictions
label_ids = pred.label_ids
# Replace -100 with the pad_token_id
label_ids[label_ids == -100] = tokenizer.pad_token_id
# Decode predictions and labels
pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)
# Tokenize the outputs for BLEU metric
pred_str = [p.split() for p in pred_str]
label_str = [[l.split()] for l in label_str]
# Calculate BLEU score
bleu = metric.compute(predictions=pred_str, references=label_str)
return {"bleu": bleu["bleu"]}
"""### Define the Training Configuration
In the final step, we define all the parameters related to training. For more detail on the training arguments, refer to the Seq2SeqTrainingArguments [docs](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Seq2SeqTrainingArguments).
"""
from transformers import Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
output_dir="whisper-tiny-darija-translate", # change to a repo name of your choice
per_device_train_batch_size=1,
gradient_accumulation_steps=4, # increase by 2x for every 2x decrease in batch size
learning_rate=1e-5,
warmup_steps=25,
max_steps=25,
gradient_checkpointing=False,
fp16=True,
evaluation_strategy="steps",
per_device_eval_batch_size=1,
predict_with_generate=True,
generation_max_length=225,
save_steps=5,
eval_steps=5,
logging_steps=5,
report_to=["tensorboard"],
load_best_model_at_end=True,
metric_for_best_model="bleu",
greater_is_better=False,
push_to_hub=False,
)
"""**Note**: if one does not want to upload the model checkpoints to the Hub,
set `push_to_hub=False`.
We can forward the training arguments to the 🤗 Trainer along with our model,
dataset, data collator and `compute_metrics` function:
"""
from transformers import Seq2SeqTrainer
trainer = Seq2SeqTrainer(
args=training_args,
model=model,
train_dataset=common_voice["train"],
eval_dataset=common_voice["test"],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=tokenizer,
)
"""We'll save the processor object once before starting training. Since the processor is not trainable, it won't change over the course of training:"""
processor.save_pretrained(training_args.output_dir)
"""### Training
To launch training, simply execute:
"""
trainer.train()
kwargs = {
"dataset_tags": "darija-c",
"dataset": "Darija-C", # a 'pretty' name for the training dataset
"dataset_args": "config: ar, split: test",
"language": "ar",
"model_name": "Whisper tiny darija translate", # a 'pretty' name for our model
"finetuned_from": "openai/whisper-tiny",
"tasks": "automatic-speech-recognition",
}
trainer.save_model("./whisper-tiny-darija-translate")
"""## Building a Demo
Now that we've fine-tuned our model we can build a demo to show
off its Translate capabilities! We'll make use of 🤗 Transformers
`pipeline`, which will take care of the entire ASR pipeline,
right from pre-processing the audio inputs to decoding the
model predictions.
take the first audio of our dataset then input it to
our fine-tuned Whisper model to translate the corresponding text:
"""
from transformers import pipeline
import gradio as gr
from transformers import WhisperTokenizer
from transformers import pipeline, WhisperForConditionalGeneration, WhisperProcessor
def test_model():
model = WhisperForConditionalGeneration.from_pretrained("whisper-tiny-darija-translate")
processor = WhisperProcessor.from_pretrained("whisper-tiny-darija-translate")
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-tiny", language="Arabic", task="translate")
pipe = pipeline("automatic-speech-recognition", model=model, tokenizer=tokenizer,
feature_extractor=processor.feature_extractor)
audio=fileList[0]
text = pipe(audio)["text"]
print("Reference input : ",sentenceTranslationList[0])
print("translation output : ",text)
test_model()
I need my model to be able to take an audio file in my custom language then output the translation in another language (ex: arabic).
Upvotes: 1
Views: 248