Reputation: 1413
I am trying to develop a system for audio classification in java using mfcc features and hidden markov models. I am following this research paper: http://acccn.net/cr569/Rstuff/keys/bathSoundMonitoring.pdf .
It describes the algorithm as follows:
Each sound file, corresponding to a sample of a sound event, was processed in frames pre-emphasized and windowed by a Hamming window (25 ms) with an overlap of 50%. A feature vector consisting of a 13-order MFCC characterized each frame. We modeled each sound using a left-to-right six-state continuous-density HMM without state skipping. Each HMM state was composed of two Gaussian mixture components. After a model initialization stage was done, all the HMM models were trained in three iterative cycles.
I already have the first part working which is the feature extraction from a sample sound. As a result I get a 2d array of doubles that consists of 13 columns for each row (each row represents a frame of the sound). Now my problem is how to train the hmm using this data.
I am using the jahmm library. So far I have developed some sample code to have a general understanding how the library works.
/**Some sample data to act as the mfcc data. Here each line terminated by a new space
* is one observation. I don't know whether each line should be one row from the mfcc values
* (representing one frame) or each line should be representing a set of features from one audio file.
*/
String realSequences = "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n"
+ "1.1;2.2;3.3;4.4;5.5;6.6;7.7;8.8;9.9;10.0;11.1;12.2;13.3;\n";
/**
* This is the reader class that reads the data and puts then in a relevant collection format
*
*/
Reader reader = new StringReader(realSequences);
List<? extends List<ObservationReal>> sequences =
ObservationSequencesReader.readSequences(new ObservationRealReader(), reader);
reader.close();
/**
* As the description states that each state is composed of two Gaussian mixture components.
*/
OpdfGaussianMixtureFactory gMixtureFactory = new OpdfGaussianMixtureFactory(2);
/**
* The manual for jahmm says that KMeans learner is a good way to initialize the hmm. It has 6 states
* and uses the two gaussian mixture models created above.
*/
KMeansLearner<ObservationReal> kml = new KMeansLearner<ObservationReal>(6, gMixtureFactory, sequences);
Hmm<ObservationReal> initHmm = kml.iterate();
/*
* As the papers states the hmm is trained in 3 iterative cycles.
*/
BaumWelchLearner bwl = new BaumWelchLearner();
Hmm<ObservationReal> learntHmm = null;
for (int i = 0; i < 3; i++) {
learntHmm = bwl.iterate(initHmm, sequences);
}
My questions are:
Q1: In what format the mfcc data should be passed to train the hmm? (See comments by the realSeuqences line)
Q2: In speech recognition sometimes we need to train the system by repeating the same word lets say 10 times. Does it mean it trains one hmm with those 10 samples? If yes then how to train one hmm with different samples of the same sound. Or is it 10 separately trained hmm but labeled with that word?
Q3: How to compare two hmm models in terms of sound recognition. Is it better to use viterbi or Kullback Leibler Distance ?
Upvotes: 2
Views: 3186
Reputation: 25220
Q1: In what format the mfcc data should be passed to train the hmm? (See comments by the realSeuqences line)
The MFCC data must be represented as:
List<? extends List<ObservationVector>> sequences
It's a list of data sequences. Each sequence corresponds to word sample and is a list of vectors, each vector represent a frame and contains 13 MFCC values.
Q2: In speech recognition sometimes we need to train the system by repeating the same word lets say 10 times. Does it mean it trains one hmm with those 10 samples?
Input data is a list of sequences for each word. This list is processed together.
If yes then how to train one hmm with different samples of the same sound. Or is it 10 separately trained hmm but labeled with that word?
It's one HMM. The hmm training algorithm works with several samples of each word. It actually needs quite many samples, more than 10.
Q3: How to compare two hmm models in terms of sound recognition. Is it better to use viterbi or Kullback Leibler Distance ?
It's not quite clear what do you mean by "compare" here. Do you want one HMM to have less state than the other or what. What property do you want to use to compare. Answer depends on that.
And, it's important to note that speech recognition HMM training has some specific (how to select number of states, which features to use, how to initialize HMM). For that reason for best performance it's better to use a specialized toolkit like CMUSphinx (http://cmusphinx.sourceforge.net), not the generic one.
Upvotes: 2