Pawan Kumar
Pawan Kumar

Reputation: 1533

Offline Speech Recognition in browser

I am working on a product that need to take inputs from user and do certain actions based on it. We have implemented it with a chat box via typing and it is serving our purpose. For the future releases we want to add voice recognition to the chat window. We thought of using

window.speechRecognition() || window.webkitSpeechRecognition()

but we came to know that the functionally available in browsers use Google's Cloud Speech API. As we deal with very sensitive information of users this will be security issue. Is there any other alternatives for implementing the speech recognition that works in any browsers.

Upvotes: 10

Views: 15763

Answers (5)

Gabriel Grant
Gabriel Grant

Reputation: 5591

There isn't a great answer to this, but your best bet for offline speech recognition at the moment (Aug, 2023) is using an implementation of OpenAI's Whisper model, compiled to WebAssembly. There are three that I know of:

  1. ggerganov's whisper.cpp
  2. xenova implementation on transformers.js
  3. HuggingFace's implementation on Candle

Note this still isn't a great option for a few reasons:

  1. download size: because it isn't built into the browser, it requires the browser to download a large model file (absolute minimum of 31 MB for the quantized "tiny" model)
  2. quality: there's a pretty direct tradeoff between model size and quality. Only the tiniest quantized models are even close to reasonable to load in most webpages, and you're not going to get top notch results with these. Even the largest models that you can possibly load into a browser (likely small, maybe medium quantized, or, if you're really brave/masochistic, perhaps try the "large" quantized... but it's over 1GB) aren't going to be as good as the large unquantized models that can only reasonably run on a server. And even if you do get these models loaded, they're going to be lacking...
  3. Inference speed: the tiny model can just barely keep up with real-time transcription on a relatively new/powerful laptop or desktop (it doesn't quite keep up on my older X1 Carbon gen 7 laptop). It will likely lag significantly on most mobile devices. And larger models will be even slower. This, for me, is the biggest problem. Try it out for yourself with ggerganov's stream demo
  4. Complexity: Getting any of these up and running in your own project is not entirely straightforward, and generally much lower level than the Web Speech Recognition API. For example, the core of the transformers.js implementation, which seems to be the simplest, is over 100LOC (and this just handles pre-recorded files, not real-time transcription).

Part of the added complexity is because these types of models generally work in chunks. For longer audio files, and especially for real-time transcription, we want a continuous stream of audio to produce a continuous stream of output text. The Web Speech Recognition API handles that for you, while with Whisper you have to do the chunking yourself (and deal with things like window overlaps or corrected transcriptions of already-seen words).

There is a good description of some of these issues with using lower-level speech recognition model APIs in the README of Google's Open Source Live Transcribe Speech Engine.[1]

All that to say, it would be really nice if we could just use the Web Speech Recognition API itself, with an offline browser-native model, but I haven't seen any recent movement in that direction.[2][3] If you can accept the limitations, Whisper might be a workable alternative (and if you want to make a Web Speech API polyfill, I'm sure it would be very much appreciated!)


[1]: In the announcement post for that library, Google recognized the complications in using an online system. Unfortunately, despite the name, this project isn't actually what I'd really call a "Live Transcribe Speech Engine", but instead a library to do live transcription using Google's cloud transcription API.

[2]: in fact, Chrome does ship a library to do offline transcription called libSODA (Speech On-Device), but it was initially released for the Live Caption feature, and seems to still not be used for the user-facing voice-to-text. Not so surprisingly, "the Speech team was concerned about unauthorized repurposing of their components", so I'd guess general availability for speech to text usage is something we can expect in the near future.

[3]: At one point Mozilla was building a speech to text engine called DeepSpeech to embed in Firefox, but apparently dropped development. Some former members of the DeepSpeech team forked the project and continued to the work for a while as Coqui AI STT, but have since retired that effort and recommend using Whisper instead.

Upvotes: 6

john swana
john swana

Reputation: 53

use tensorflowjs "tfjs" model the most sensible solution which works in the browser

Speech Command Recognizer The Speech Command Recognizer is a JavaScript module that enables recognition of spoken commands comprised of simple isolated English words from a small vocabulary

Upvotes: 4

user6269864
user6269864

Reputation:

Apparently PocketSphinx.js is the only available way to go as of now. It's an open-source speech-to-text engine that supports English but not many languages beyond that.

Github:

However, if you want to run your code on a single instance of an Android device (e.g. a device displayed somewhere in a public area), you can use "Download offline voice recognition language" in mobile Chrome's settings. There is no such option for the desktop browser.

Upvotes: 3

gdm
gdm

Reputation: 7938

You can try:

  • Snowboy, no wav are stored in the server. They train a NN for you and you can download the model's weights.
  • Tensorflow: It's really great but it requires a bit of work on your side. Successful projects for TTS are DeepSpeech and related.

Upvotes: 4

Daniel Bolanos
Daniel Bolanos

Reputation: 795

You can try IBM Watson's Speech To Text service. It can be used from any browser and you can opt-out so user's data is not logged server-side: https://console.bluemix.net/docs/services/watson/getting-started-logging.html#controlling-request-logging-for-watson-services

The demo of the service is here: https://speech-to-text-demo.ng.bluemix.net/

It works at least in Firefox and Chrome, it is based in the following open source SDK: https://github.com/watson-developer-cloud/speech-javascript-sdk

ps. For the general case, when user's data is not sensitive, it is better not to opt-out so Watson can leverage the data to improve service quality.

Upvotes: 0

Related Questions