Reputation: 23
I'm trying to build an application that solves the problem of speaker diarization by using the Microsoft Cognitive Speaker Recognition APIs.
Looking at the sample project and reading the APIs documentation, i understood that the recognition should be done sending a wav file to the service, which goes against my goal of doing it real time.
Has someone done some research on that? Is it feasible using those APIs or i should look for another road?
Upvotes: 2
Views: 1735
Reputation: 1287
There is no stream approach, just like Google has with Speech API. To enroll new profile no need to have 30 seconds. In my recent practice - I had successful results for ~10 seconds. The core issue with MS API - is restrictions with multiple speakers. You have to find your own way how to divide them into separate audio tracks. Otherwise it will recognize the very first known voice.
Upvotes: 0
Reputation: 25220
Enrollment needs 30 seconds of data. Once you have user profile you can identify users from 1 second samples so you can do it almost in realtime with very small delay. To use this you need to set shortAudio parameter. It's hard to imagine identification works faster than that.
In case you need something different there are open source speech toolkits like Kaldi which could do more flexible things.
Upvotes: 1