Triterium
Triterium

Reputation: 25

How to train custom speech model in Microsoft cognitive services Speech to text

I'm doing a POC with Speech to text. I need to recognize specific words like "D-STUM" (daily stand up meeting). The problem is, every time I tell my program to recognize "D-STUM", i get "Destiny", "This theme", etc.

I already went on speech.microsoft.com/.../customspeech, and I've recorded around 40 wav files of people saying "D-STUM". I've also created a file named "trans.txt" which contains every wav file with the word "D-STUM" after each file. Like this : D_stum_1.wav D-STUM D_stum_2.wav D-STUM D_stum_3.wav D-STUM D_stum_4.wav D-STUM ...

Then I uploaded a zip containing the wav files and the trans.txt file, train a model with those datas, and created an endpoint. I referenced this endpoint on my soft, and launched it.

I expect my custom speech-to-text to recognize people saying "D-STUM" and displaying "D-STUM" as text. I never had "D-STUM" displayed after customizing the model.

Did I do something wrong? Is it the right way to do a custom training? Is 40 samples not enough for the model to be properly trained?

Thank you for your answers.

Upvotes: 1

Views: 337

Answers (1)

Nicolas R
Nicolas R

Reputation: 14619

Custom Speech has several ways to get a better understanding of specific words:

  • By providing audio sample with their transcription, as you have done
  • By providing text sample (without audio)

Based on my previous use-cases, I would highly suggest to create a training file with 5 to 10 sentences in it, each one containing "D-STUM" in its usage context. Then duplicate those sentences like 10 to 20 times in the file.

It worked for us to understand specific words.

Additionally, if you are using "en-US" or "de-DE" as target language, you can use a pronunciation file, see here

Upvotes: 1

Related Questions