Reputation: 11746
I'm new to Speech Recognition, and I'm looking for an approach to split a sentence (or multiple sentences) in the form of audio/wav files, into individual words? This sounds like a standard problem, so I'm wondering how people in the industry approach it.
ps: yes this question was asked three years ago, but I'm looking for an up-to-date answer using newer libraries (i.e. pytorch and tensorflow 2.0). Thanks!
Upvotes: 2
Views: 1340
Reputation: 68160
This is not so trivial.
What you want is called an alignment. I.e. where each audio frame is aligned to a word (or subword, character, or better individual phonemes).
The most reasonable approach would need a standard conventional speech recognition system. The easiest would be to use a HMM system, either backed by old fashioned GMMs, or maybe by NNs (which is called hybrid HMM-NN model). This also requires a lexicon (mapping of phonemes to words). Usually you would use an existing implementation of all that, e.g. Kaldi or RASR, as this is not so simple to implement. I have not seen a pure TF implementation of that. This software then calculates the best possible alignment path through the HMM (i.e. which has the highest probability, according to the trained model). If you know the ground truth words, this is the Viterbi algorithm, to calculate this best path. Otherwise you would do some decoding (using beam search).
What you can also do, but this will be more hacky, and less good (for this task of getting an alignment): Use some of the end-to-end models, e.g. encoder-decoder with attention, or CTC. For encoder-decoder with attention, you can use the attention weights to get a good guess where the words are (and then you can maybe guess where the boundaries are). Similar for CTC. But this will not be accurate. But this is something you can implement easily in pure TF.
In any case, the implementation itself is not so much the hard part (although still not simple). You first need to understand all the theory behind that. And maybe StackOverflow is not the right place to ask about that. Read through the Kaldi or RASR documentation maybe, or watch some lecture about speech recognition, or read a book about that topic.
Upvotes: 3