Reputation: 2484
I understand the basic steps of creating an automated speech recognition engine. However, I need a clear-er idea of how segmentation is done and what are frames and samples. I will write down what I know and expect the answer-er to correct me in the places where I'm wrong and guide me further.
The basic steps of Speech Recognition as I know it are:
(I'm assuming the input data is a wav/ogg (or some kind of audio) file)
Although these are clear to me, I am confused if step 3 is correct. If It is correct, In the steps following 3, do I apply that to each frame? Also, after step 6, I think that each frame has their own set of MFCC, am I right?
Thank you in advance!
Upvotes: 7
Views: 3302
Reputation: 25220
Segment the clip into smaller time frames, each segment being like 30msecs long. Further, Each segment will have about 256 Frames and two segments will have a seperation of 100 Frames? (i.e., 30*100/256 msec ?)
Not frames, but samples. Each frame of 30ms at 8khz sample rate is 30/1000 * 8000 = 240 samples. Frames are overlapped and shift between frames is 10ms or 80 samples. Here how it looks on the picture:
Here Q is 80 and K is 240 samples.
If it is correct, in the steps following 3, do I apply that to each frame?
Yes
Also, after step 6, I think that each frame has their own set of MFCC, am I right.
Yes.
Upvotes: 8