Reputation: 25
I'm doing a POC with Speech to text. I need to recognize specific words like "D-STUM" (daily stand up meeting). The problem is, every time I tell my program to recognize "D-STUM", i get "Destiny", "This theme", etc.
I already went on speech.microsoft.com/.../customspeech, and I've recorded around 40 wav files of people saying "D-STUM". I've also created a file named "trans.txt" which contains every wav file with the word "D-STUM" after each file. Like this : D_stum_1.wav D-STUM D_stum_2.wav D-STUM D_stum_3.wav D-STUM D_stum_4.wav D-STUM ...
Then I uploaded a zip containing the wav files and the trans.txt file, train a model with those datas, and created an endpoint. I referenced this endpoint on my soft, and launched it.
I expect my custom speech-to-text to recognize people saying "D-STUM" and displaying "D-STUM" as text. I never had "D-STUM" displayed after customizing the model.
Did I do something wrong? Is it the right way to do a custom training? Is 40 samples not enough for the model to be properly trained?
Thank you for your answers.
Upvotes: 1
Views: 337
Reputation: 14619
Custom Speech has several ways to get a better understanding of specific words:
Based on my previous use-cases, I would highly suggest to create a training file with 5 to 10 sentences in it, each one containing "D-STUM" in its usage context. Then duplicate those sentences like 10 to 20 times in the file.
It worked for us to understand specific words.
Additionally, if you are using "en-US" or "de-DE" as target language, you can use a pronunciation file, see here
Upvotes: 1