Reputation:
this is my first time posting a question here so if the approach is not so standard i apologize, i understand there are lots of questions out there on this and i have read tons of thesis, questions, aritcles and tutorials yet i seem to have a problem and it's always best to ask. i am creating a speech recognition application, using phoneme level processing(not isolated word) continuous HMMs based on gaussian mixture models, involving baum welch, forward-backward, and viterbi algorithms, i have implemented a very good feature extraction and pre-processing method (MFCC), feature vectors consist of the mfcc, delta and acceleration coefficients as well and it's working pretty well on it's part however when it comes to HMMs , i seem to either have a 'Major Misunderstading' about how HMMs are supposed to help recognize speech or i am missing a little point here...i have try harded a lot that at this point i can't really tell what's wrong and right.
first off, i recorded around 50 words, each 6 utterances, and run them through a correct compatibility and conversion program that i wrote myself and the extracted the features so that they can be used for baum-welch.
i want you to please tell me where am i making a mistake in this procedure, also i will mention a few doubts i have on it so that you can help me understand this whole subject better.
here are the steps in my application concerning anything related to the training : steps for initial parameters of HMM model :
1 - assign all observations from each training sample of each model to their corresponding discrete state(or in other words, which feature vector belongs to which alphabetic state).
2 - use k-means to find the initial continuous emission parameters, clustering is done over all observations of each state, here the cluster size is 6 (equal to number of mixtures for probability density function), parameters would be sample means, sample covariances and some mixture weights for each cluster.
3 - create initial state-initial and transition probability matrices for each model and training sample individually(left right structure is used in this case), 0 for previous states and 1 for up to 1 next state for transitions and 1 for initial and 0 for others in state initials.
4 - calculate gaussian mixture model based probability density function for each state -> it's corresponding cluster -> assigned to all the vectors in all the training samples for each model
5 - calculate initial emission parameters using the pdf and mixture weights for clusters.
6 - now calculate the gamma variables using initial paramters(transitions, emissions, initials) in forward-backward and initial PDFs, using the continuous formula for gamma..(gamma = probability of being in a certain state at a certain time for any of the mixtures)
7 - estimate new state initials
8 - estimate new state tranisitons
9 - estimate new sample means
10 - estimate new sample covariances
11 - estimate new pdfs
12 - estimate new emissions using new pdfs
repeat the steps from 6 to 12 using new estimated values on each iteration, use viterbi to get an overlook on how the estimating is going and when the probability is not changing anymore, stop and save.
now my issues : first i don't know if the entire procedure i have followed is correct or not, or is there a better method to approach this...for all i know is that the convergence is pretty fast, for up to 4-5 iterations and it's already not changing anymore, however considering that if i am right then : it's not possible for me to sit down and pre assign each feature vector to it's state in the beginning at step 1...and i don't think it's a standard procedure either...again i don't even know if i have to do it necessarily, from all my studies it was the best method i could find to get a rapid convergence.
second, say this whole baum welch has done a great job in re estimating and finding local maximums, what's raising my doubt about my baum welch implementation is that how are they later going to help me recognize speech? i assume the estimated parameters are used in viterbi for finding the optimal state for every spoken utterance...if so then emission parameters are not known cause if you look closely you will see that final emission parameters in my algorithm will be assigning each alphabetic state of each model to all the observed signals in each model, other than that...no emission parameters can be found if the signal is not exactly match to the ones used in re-estimation, and it won't obviously work, any attempt to try and match out the signals and find emissions will make the whole HMM lose it's purpose...
again i might have a wrong idea about almost everything here, i would appreciate if you help me understand what i am doing wrong here...if ANYTHING is wrong, notify me please...thank you.
Upvotes: 1
Views: 713
Reputation: 7511
You're attempting to determine the most likely set of phonemes that would have generated the sounds that you're observing - you're not attempting to work out emission parameters, you're working out the most likely set of inputs that would have produced them.
Also, your input corpus is quite small - it's unsurprising that it would converge so quickly. If you're doing this while involved with a university, see if they have access to one of the larger speech corpuses commonly used to train this kind of algorithm on.
Upvotes: 1