Reputation: 31
I'm trying to use Vosk speech recognition in a Python script, but the result is always :
{
"text" : ""
}
It's not a problem with my file because when I use in DOS "vosk-transcriber -l fr -i speech3.wav -o test6.txt" it works perfectly and I got a test6.txt with an accurate transcription.
Here is my Python :
import vosk
# Load the Vosk model
model = vosk.Model("voskSmallFr")
# Initialize the recognizer with the model
recognizer = vosk.KaldiRecognizer(model, 16000)
# Sample audio file for recognition
audio_file = "speech3.wav"
# Open the audio file
with open(audio_file, "rb") as audio:
while True:
# Read a chunk of the audio file
data = audio.read(4000)
if len(data) == 0:
break
# Recognize the speech in the chunk
recognizer.AcceptWaveform(data)
# Get the final recognized result
result = recognizer.FinalResult()
print(result)
I downloaded and tried every models available in French (my wav file is in French) on the official Vosk website (4 in total), the scripts run well but give no results contrary to the Windows command...
Any ideas? Thank you
Upvotes: 0
Views: 1088
Reputation: 31
I'm answering my own question in order to post the final solution to my problem, but it's mainly thanks to Lewis answers and comments below.
Thank you Lewis !
the input .wav file must be PCM 16 bit mono, wich can be obtain with "ffmpeg -i "speech3.wav" "outfile.wav"
in windows cmd after installing ffmpeg.
import wave
import json
from vosk import Model, KaldiRecognizer, SetLogLevel
#.wav file must be PCM 16-bit mono !
def vosk(wavFile):
SetLogLevel(0)
wf = wave.open(wavFile, "rb")
model = Model(model_path="voskSmallFr", model_name="vosk-model-small-fr-0.22")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)
rec.SetPartialWords(True)
text = []
while True:
data = wf.readframes(4000)
if len(data) == 0:
break
# if silence detected save result
if rec.AcceptWaveform(data):
text.append(json.loads(rec.Result())["text"])
text.append(json.loads(rec.FinalResult())["text"])
text=str(text)[2:-2]
return text
print(vosk("outfile.wav"))
Upvotes: 1
Reputation: 832
When silence is detected AcceptWaveform()
returns True and you can retrieve the result with Result()
. If it returns False you can retrieve a partial result with PartialResult()
. The FinalResult()
means the stream is ended, buffers are flushed and you retrieve the remaining result which could be silence.
What you could do is
import json
text = []
with open(audio_file, "rb") as audio:
while True:
data = audio.read(4000)
if len(data) == 0:
break
# if silence detected save result
if recognizer.AcceptWaveform(data):
text.append(json.loads(recognizer.Result())["text"])
text.append(json.loads(recognizer.FinalResult())["text"])
and you get a list of sentences.
Edited:
If you want to try to replicate what I did here is the code and the audio I used. It worked.
import wave
import json
from vosk import Model, KaldiRecognizer, SetLogLevel
SetLogLevel(0)
wf = wave.open("test.wav", "rb")
model = Model(model_name="vosk-model-en-us-0.22-lgraph")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)
rec.SetPartialWords(True)
text = []
while True:
data = wf.readframes(4000)
if len(data) == 0:
break
# if silence detected save result
if rec.AcceptWaveform(data):
text.append(json.loads(rec.Result())["text"])
text.append(json.loads(rec.FinalResult())["text"])
print(f"\n{text}")
Upvotes: 1