Reputation: 1227
I'm working on simple TTS-engine. It would be good to have a automatic diphone segmentation system which takes a recorder sound and phoneme subscript (for single utterance) and sets the phoneme boundaries in the sound. Is it possible to be done with CMU Sphinx? Which version of sphinx I should to use?
Upvotes: 2
Views: 4454
Reputation: 25220
You can train a speaker-dependent model specific to your speaker with Sphinxtrain. For more details on training see
http://cmusphinx.sourceforge.net/wiki/tutorialam
To segment the database you can use sphinx3_align binary like this:
sphinx3_align \
-hmm <model_dir> \
-dict dictionary.dic \
-ctl db.fileids \
-cepdir <feats_folder> \
-cepext .mfc \
-insent db.transcription \
-outsent db.out \
-phlabdir phlabdir
The phone-level alignment will be created in a folder called phlabdir
Upvotes: 2