Rémi Descamps
Rémi Descamps

Reputation: 31

Use Vosk speech recognition with Python

I'm trying to use Vosk speech recognition in a Python script, but the result is always :

{
  "text" : ""
}

It's not a problem with my file because when I use in DOS "vosk-transcriber -l fr -i speech3.wav -o test6.txt" it works perfectly and I got a test6.txt with an accurate transcription.

Here is my Python :

import vosk

# Load the Vosk model
model = vosk.Model("voskSmallFr")

# Initialize the recognizer with the model
recognizer = vosk.KaldiRecognizer(model, 16000)

# Sample audio file for recognition
audio_file = "speech3.wav"

# Open the audio file
with open(audio_file, "rb") as audio:
    while True:
        # Read a chunk of the audio file
        data = audio.read(4000)
        if len(data) == 0:
            break
        # Recognize the speech in the chunk
        recognizer.AcceptWaveform(data)

# Get the final recognized result
result = recognizer.FinalResult()
print(result)

I downloaded and tried every models available in French (my wav file is in French) on the official Vosk website (4 in total), the scripts run well but give no results contrary to the Windows command...

Any ideas? Thank you

Upvotes: 0

Views: 1088

Answers (2)

Rémi Descamps
Rémi Descamps

Reputation: 31

I'm answering my own question in order to post the final solution to my problem, but it's mainly thanks to Lewis answers and comments below. Thank you Lewis ! the input .wav file must be PCM 16 bit mono, wich can be obtain with "ffmpeg -i "speech3.wav" "outfile.wav" in windows cmd after installing ffmpeg.

import wave
import json
from vosk import Model, KaldiRecognizer, SetLogLevel


#.wav file must be PCM 16-bit mono !

def vosk(wavFile):
    SetLogLevel(0)

    wf = wave.open(wavFile, "rb")

    model = Model(model_path="voskSmallFr", model_name="vosk-model-small-fr-0.22")
    rec = KaldiRecognizer(model, wf.getframerate())
    rec.SetWords(True)
    rec.SetPartialWords(True)
                    
    text = []    
    while True:
        data = wf.readframes(4000)
        if len(data) == 0:
             break
        # if silence detected save result
        if rec.AcceptWaveform(data):
            text.append(json.loads(rec.Result())["text"])
    text.append(json.loads(rec.FinalResult())["text"])

    text=str(text)[2:-2]
    return text



print(vosk("outfile.wav"))

Upvotes: 1

Lewis
Lewis

Reputation: 832

When silence is detected AcceptWaveform() returns True and you can retrieve the result with Result(). If it returns False you can retrieve a partial result with PartialResult(). The FinalResult() means the stream is ended, buffers are flushed and you retrieve the remaining result which could be silence.

What you could do is

import json
                
text = []    
with open(audio_file, "rb") as audio:
    while True:
        data = audio.read(4000)
        if len(data) == 0:
             break
        # if silence detected save result
        if recognizer.AcceptWaveform(data):
            text.append(json.loads(recognizer.Result())["text"])
text.append(json.loads(recognizer.FinalResult())["text"])


and you get a list of sentences.

Edited:

If you want to try to replicate what I did here is the code and the audio I used. It worked.

import wave
import json
from vosk import Model, KaldiRecognizer, SetLogLevel

SetLogLevel(0)

wf = wave.open("test.wav", "rb")

model = Model(model_name="vosk-model-en-us-0.22-lgraph")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)
rec.SetPartialWords(True)
                
text = []    
while True:
    data = wf.readframes(4000)
    if len(data) == 0:
         break
    # if silence detected save result
    if rec.AcceptWaveform(data):
        text.append(json.loads(rec.Result())["text"])
text.append(json.loads(rec.FinalResult())["text"])

print(f"\n{text}")

Upvotes: 1

Related Questions