Allen Pol
Allen Pol

Reputation: 51

Why do you need to segment the audios 5-30 seconds each for building the acoustic model?

Sphinx4 requires the audio in the acoustic model to be segmented 5-30 seconds each. Why? And how do you segment the audio? When will you segment it at 5 seconds or at 10 seconds or at 25 seconds? Thank you dear sir!

Upvotes: 2

Views: 112

Answers (2)

Nikolay Shmyrev
Nikolay Shmyrev

Reputation: 25220

Sphinxtrain performs alignment of text to audio for the training. It tries to match phonemes with the individual pieces of audio. When audio is long it is harder to get a good match because there are too many variants and possibilities for mistake, for that reason it is better to keep recommended utterance length.

When you segment the audio you need to split on silence regions, it is not much matter what is the utterance length, it is more important to have small silence regions in the beginning and in the end. Small silence region helps trainer to find context.

Upvotes: 1

Mido
Mido

Reputation: 665

As a rule of thumb, the longer the segment, the better it is. To segment the audio, you might need to look at sox. It has a trim command that would be handy for the segmentation.

Upvotes: 0

Related Questions