João-Aroeira
João-Aroeira

Reputation: 71

Is it possible to transcript a Twilio call "as you speak"?

Does anyone knows if it is possible by Twilio to create multiple audio records during a call based on a kind of audio flag or pattern, like silence for example. So that you could fire a callback on the end of each portion of speech to generate text during the call.

thank...

Upvotes: 2

Views: 877

Answers (2)

seth machine
seth machine

Reputation: 1

To get live transcription with Twilio, you would need to use a 3rd party Speech To Text with Twilio Media Streams that also supports a streaming/infinite speech to text recognition, like Google Cloud Speech To Text. Unfortunately I don't think there is a native Twilio verb or action that does live speech to text/live transcription. Maybe you could run something on iOS, but I think having a backend server handle this is probably better and more scaleable in the future.

At a high level you need to do the following:

  • Create a WebSocket endpoint to ingest Twilio Media Streams for incoming audio byte payloads. These payloads are the base 64 encoding of speech over the telephone
  • Send the media stream to a 3rd party speech to text provider, like Google Cloud
  • Publish the transcription results to the end user (e.g. polling through an API or ideally a real time connection like another WebSocket).

Twilio themselves has created several different guides on how to do this:

I spent sometime familiarizing myself with these guides and made a similar live transcription guide in Java using Dropwizard framework as well (written by myself)

These approaches will work for proof-of-concepts but do not cover areas related to security or scaling of the audio stream processing.

Upvotes: 0

xmjw
xmjw

Reputation: 3434

Twilio Evangelist here.

So, you could use the timeout attribute on the <Record> verb to get short 'bursts' of spoken text, but this may mean you time out while the caller is speaking a word. So you would only get half of it! This may make it difficult to decipher what is being said, and I would personally not use this approach.

You can end recording on a key-press (a DTMF tone) with the finishOnKey attribute, which may help your needs.

You cannot currently get a live, or near realtime transcription. You will receive the transcription very quickly, but we only support the timeout and key presses to end a recording and begin transcription.

Hope this helps!

Upvotes: 4

Related Questions