Jason Maldonis
Jason Maldonis

Reputation: 337

How can I get the start and end times of words in an audio file with a known transcript using Vosk?

I'm using Vosk (https://alphacephei.com/vosk/) in Python and I want to get the start and end times of every word in an audio file, and I have the transcript of the audio file.

I'm using some code I found online to perform speech-to-text using Vosk, and it also gives the start and end times of every word. Unfortunately the transcription isn't perfect.

Since I have the perfect transcript, I want to tell Vosk what the correct transcript is and have it tell me the start and end times of every word. Is this possible?

Here is the code I'm using now:

import wave
import json

from vosk import Model, KaldiRecognizer

model_path = r".\vosk_models\vosk-model-en-us-0.22"
audio_filename = "some_audio_file.wav"

model = Model(model_path)
wf = wave.open(audio_filename, "rb")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)  # Include the start and end times for each word in the output

# get the list of JSON dictionaries
results = []
# recognize speech using vosk model
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        part_result = json.loads(rec.Result())
        results.append(part_result)
part_result = json.loads(rec.FinalResult())
results.append(part_result)

wf.close()  # close audiofile

Upvotes: 5

Views: 2031

Answers (2)

igrinis
igrinis

Reputation: 13666

A generic speech recognition engine is not supposed to "know" the "perfect" transcription so there is no direct way to provide this information into it.

Now you have 2 options. The first one is to take the resulting recognized word sequence you've got from Vosk and create a match to a "perfect" transcription using some variant of Levenstein distance (for example sequence-align). Thus you will get a match between the words in a "perfect" and recognized word sequences, so mapping timings becomes a trivial task (obviously you will have to deal with misrecognized sequences of a few words, but nothing really challenging).

Another option is to use so called "forced alignment". This is exactly your case, when the "perfect" transcript is known, and you need to find time alignment. You can use gentle aligner or Montreal Forced Aligner (example), both use old Kaldi technology or a more modern torch based Wav2Vec model (example)

Upvotes: 3

J.M. Robles
J.M. Robles

Reputation: 652

Perhaps you could make use of sttcast. It uses vosk to transcribe to an HTML file from which you can collect timestamps and text to correct. I think it is possible to automatize the task if you have hundreds of hours of audio, but for only a few hours, you should consider making it manually

Output of sttcast

Upvotes: -1

Related Questions