Reputation: 1892
I use microsoft-cognitiveservices-speech-sdk (1.38.0) in order to do real time speech to text. It seems like the offset is right when I send a full audio but it is wrong when I send it cut in a lot of audio chunks.
The more there is audio chunks the more inaccurate the offset is :
To reproduce here is some piece of code :
const speechConfig = SpeechConfig.fromSubscription(<KEY>, <REGION);
const pushStream = AudioInputStream.createPushStream();
const audioConfig = AudioConfig.fromStreamInput(pushStream);
const speechRecognizer = new SpeechRecognizer(speechConfig, audioConfig);
speechRecognizer.recognized = async (recognizer, event) => {console.log(event)}
speechRecognizer.canceled = async (recognizer, event) => {console.log(event)}
speechRecognizer.startContinuousRecognitionAsync();
for (let i = 1; i <= 1443; i++) {
const formattedNumber = i.toString().padStart(4, '0');
const buffer = fs.readFileSync(`/var/tmp/chunks/output_${formattedNumber}.wav`);
pushStream.write(buffer);
}
To create the audio chunks :
ffmpeg -i <INPUT_FILE> -f segment -segment_time 0.1 -c copy output_%04d.wav
Here is the audio link : https://drive.google.com/file/d/1H_RJuqMiBaVkpo9XHrgp1bpuFdgQl64O/view?usp=sharing
Thanks for your help
Upvotes: 0
Views: 58