Reputation: 4276
I spent whole last week to search on MFCC and related issues. Now I can get MFCC features from a .wav file in a 2-dimensional vector, coff[56][12], let's say. 12 is the number of coefficents I want to extract and 56 is the number of frames. According to several documents I read, we can use above 12 coefficents to recognize speech (in particular, I want to recognize word "one", "two"... to "ten"). But now I get 56 of 12-cofficents, so which one among 56 frames I should use?
If I got something wrong, please help me!!!
Upvotes: 0
Views: 3689
Reputation: 2507
You are skipping some crucial steps. Let me briefly explain how it should work. Speech data is initially a discrete signal. You cut it into pieces called "frames" so small that each piece hopefully contain no more than a single phone. Often frames are overlapped to not to lost any vital information. Then you extract features - MFCCs and using Hidden Makov Model search for the most probable word that comprises a number of frames. At this time you also need a dictionary of words pronunciation and the acoustic model. On the next level you use a language model that describes sentences the words can be constructed into, and get the final hypothesis. This is extremely abstract description, so need to review each step of decoding on a closer extent.
Upvotes: 9