Reputation: 41
I am recently working on using CMU's sphinx4 for transcription and eventually forced alignment, i.e. aligning audio with its transcript.
I found a project called AutoCap that basically did what I wanted to develop. So, I installed it but it did not work. I tried tweaking it but all I obtained was incorrect timestamps.
So, I thought of using sphinx4 and giving it a go myself. I successfully transcribed a wav file using Sphinx's Transcriber.jar file. But I could not get it working for an audio with non-digits data. The readme page states 'people who want to transcribe non-digits data should modify the config.xml file to use the correct grammar, language model, and linguist to do so'.
So, can anyone provide me some help on either of these :
Thanks.
Upvotes: 4
Views: 1634
Reputation: 1486
I am currently working on the same issue, i.e transcribing non digit data. I have looked briefly into the sphinx 4 programmers guide documentation, and used the language models, acoustic models, and the JSGF Grammar as suggested. however the response obtained was not up to the mark. What I believe is merely tweaking the parameters or changes in the config.xml alone will not suffice. I think we would need a home grown algorithm to go along with sphinx 4 which can perform better speech recognition. From my side.. i have used the lextreeliguist, JSGFGrammar and the trigram language model. But the response was not great. perhaps because the audio input was not exactly american english. Will work on it a bit more .. and let you know my results
Upvotes: 0
Reputation: 25220
There is a specific project dedicated to speech to text alignment. This is not a trivial task. The development goes in a separate sphinx4 branch. You can find some details here
http://cmusphinx.sourceforge.net/?s=long+audio+alignment
If you have any question on this project you are welcome to ask on sphinx4 forum
http://sourceforge.net/projects/cmusphinx/forums/forum/382337
Upvotes: 2