Aaron B
Aaron B

Reputation: 173

Watson Nodejs Speech To Text - train language model

I have been using watson-speech@^0.7.5 to generate subtitles for streaming videos (HLS) for sporting customers.

Further, I have been able to train language models.

I would like to use recognizeElement and my customization_id for my trained language model. However, I have 2 problems:

1) I think recognizeElement has been deprecated

The library call I am using is

videoProps.stream = WatsonSpeechToText.recognizeElement({
      element: myMediaElement,
      token: videoProps.ctx.token,
      muteSource: false,
      autoPlay: false,
      model:videoProps.ctx.currentModel,
      timestamps: true,
      profanity_filter: true,
      inactivity_timeout: -1,
      continuous: true
    })
    .pipe(new WatsonSpeechToText.FormatStream());

However, I noticed that watson-speech 0.19.3, the API has been removed. Is there an alternative now?

Also, I would like to use a custom language model I trained. Will this API be updated to include the following call?

element: myMediaElement,
          token: videoProps.ctx.token,
          muteSource: false,
          autoPlay: false,
          customization_id:videoProps.ctx.currentModel.replace('custom:',''),
          timestamps: true,
          profanity_filter: true,
          inactivity_timeout: -1,
          continuous: true

2) I do not think the API supports customization_id's.

While looking in recognize-stream.js, I noticed that OPENING_MESSAGE_PARAMS_ALLOWED nor QUERY_PARAMS_ALLOWED support customization_id.

I can certain pull down the source and make the changes but again, recognize element is gone.

Thanks, Aaron.

Upvotes: 2

Views: 523

Answers (1)

Nathan Friedly
Nathan Friedly

Reputation: 8166

I sent you an email with a few other details, but I'll go ahead and copy the important parts here in case anyone else has the same question:

I removed recognizeElement() in v0.15 for a few reasons:

  • Reduced transcription quality - the audio goes through a couple of extra conversion steps which led to lower quality transcriptions than other methods of transcribing a given source

  • Inconsistent output - due to browser quirks, the raw audio stream will differ slightly from one playback to another, leading to subtly different transcriptions in some cases. This made the STT service appear to be inconsistent.

  • Oddities with pause/fast forward/rewind - the transcription is for the audio as it's heard coming out of the speakers, which means that rewinding will get repeated words, pausing could cause a word to be split in half, etc. Extended pauses or periods of silence can also cause a transcription timeout.

My recommended solution is to perform the transcription server-side, using ffmpeg to extract and convert the audio, then reformat the results into WebVVT format, and attach them as a subtitles track on the video. It's more work, but it produces significantly better results.

I've asked the Speech team about adding WebVVT as an output format to simplify this, but I don't know if/when it's going to happen.

Update: If you really want to use the old recognizeElement() method with a current release of the SDK, I brought it back as an example: https://github.com/watson-developer-cloud/speech-javascript-sdk/tree/master/examples/static/audio-video-deprecated

To answer the second question, a customization_id is now accepted as of v0.20. Note that the public STT service does not currently support customization, though.

Upvotes: 1

Related Questions