Reputation: 51
Sphinx4 requires the audio in the acoustic model to be segmented 5-30 seconds each. Why? And how do you segment the audio? When will you segment it at 5 seconds or at 10 seconds or at 25 seconds? Thank you dear sir!
Upvotes: 2
Views: 112
Reputation: 25220
Sphinxtrain performs alignment of text to audio for the training. It tries to match phonemes with the individual pieces of audio. When audio is long it is harder to get a good match because there are too many variants and possibilities for mistake, for that reason it is better to keep recommended utterance length.
When you segment the audio you need to split on silence regions, it is not much matter what is the utterance length, it is more important to have small silence regions in the beginning and in the end. Small silence region helps trainer to find context.
Upvotes: 1