Azure Speech-to-Text: Poor Transcription Accuracy with Client-Side Audio Buffer and Server-Side SDK Approach
I am implementing Azure Speech-to-Text in my application using the following approach:
Client Side: Audio is recorded in small buffers and sent to the server every few seconds via WebSocket.
Server Side: The buffers are processed using the Azure Speech SDK to convert speech to text.
Issue:
The transcription accuracy is poor compared to the client-side SDK implementation.
The last few words in each buffer are often missed or incorrectly transcribed.
I have also noticed audio data leakage, which seems to be affecting the transcription quality.
- Observations:
When the same audio is transcribed using the client-side SDK, the accuracy is much better.
There seems to be a delay or loss of context when sending audio in chunks, leading to hallucinations or incomplete sentences.
When longer buffers are used, the latency increases significantly, impacting the real-time experience.
- Questions:
How can I improve the transcription accuracy for this approach?
Are there any best practices for chunking and streaming audio data to Azure's Speech SDK?
Should I consider different encoding or processing methods before sending the buffers to the server?
Are there known limitations with Azure Speech SDK when handling segmented audio streams?
- Additional Details:
I am using vanilla JavaScript on the client side and Python on the server side.
The WebSocket connection is stable, and no packets are lost during transmission.
The audio is recorded using the browser's Web Audio API and then converted to WAV format before sending.
Experimented with different buffer sizes (e.g., 1s, 3s, 5s) to find the optimal chunk duration.
Adjusted the sample rate to match Azure's recommended configurations.
Confirmed that audio buffers are being sent and received in the correct sequence without any overlap or gaps.
Analyzed the raw audio data and verified that the recording quality is good.