Reputation: 1664
So I know what is MFCC (Mel Frequency Cepstrum Coefficients). But I need to understand what each value is... Is it some sort of sound frequency value or what?
Let's assume we have this kind of matrix. So each row represents the coefficients of one frame, but what are those numbers? Is it maybe highest frequency or something?
Upvotes: 4
Views: 3900
Reputation: 11
Cepstrum is typically derived by computing Discrete Cosine Transform of (symmetric) log power spectrum of a frame of speech; here, the log power spectrum [curve] is treated as a signal (https://en.wikipedia.org/wiki/Mel-frequency_cepstrum). So, the cepstral coefficients are measures of similarity between a sequence/curve (that represents the log power spectrum) and cosine waves of different 'frequencies'. The cepstral coefficients capture the rate with which the values of this sequence varies.
The first cepstral coefficient is the dot product of the log power spectrum with the [periodic] cosine wave whose one period begins at the origin (f=0) in the frequency domain, and ends at f=2*Pi radians (or equivalently, sampling frequency). An illustration: the log power spectrum of an vowel has high energy in the low frequency region (near f=0), and low energy in the high frequency region (near f=Pi). In other words, the slope of the log power spectrum curve in the range [0,Pi] has a negative slope in case of vowels. Since this variation of the log power spectrum is similar to that of the cosine wave mentioned above, the first cepstral coefficient of an vowel speech frame will have positive value. In contrast, cepstrum[1] of an unvoiced fricative such as /s/ will have negative value because its log power spectrum would have positive slope due to low energy at low frequency and high energy at high frequency as well as significant energy at low frequency due to voicing.
Similarly, cepstrum[2] would be positive if there is a major valley in the log power spectrum at f=Pi/2. The log power spectrum of a voiced fricative (ex: /z/) would come close to such a description because there is significant energy at high frequency due to fricative nature of the sound. Of course, cepstrum[0] is an average of log power spectral values; it captures the volume/loudness of the signal.
Upvotes: 1