Getting 96 MFCC features using python_speech_features

Question

I want to train my model using 96 MFCC Features. I used Librosa and I didnt get a promising result. I then tried to use python_speech_features, however I can get no more than 26 features! why! This is the shape for the same audio file

using Librosa

x = librosa.feature.mfcc(audio, rate, n_mfcc=96)
x.shape  # (96, 204)

using python_speech_features

mfcc_feature = pySpeech.mfcc(audio, rate, 0.025, 0.01, 96, nfft=1200, appendEnergy = True)
mfcc_feature.shape # output => (471, 26)

Any Thoughts!

SuperKogito · Accepted Answer

So the implementations of librosa and python_speech_features differ from each other, structure-wise and even theory-wise. Based on the docs:

https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html (also https://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.html)
https://python-speech-features.readthedocs.io/en/latest/#python_speech_features.base.mfcc

You will notice that the outputs are different, librosa mfcc output shape = (n_mels, t) whereas python_speech_features output = (num_frames, num_cep), so you need to transpose one of the two. Also you will notice that any num_ceps value above 26 in python_speech_features won't change a thing in the returned mfccs num_ceps that is because you are limited by the number of filters used. Therefore, you will have to increase that too. Moreover, you need to make sure that the framing is using similar values (one is using samples count and the other durations) so you will have to fix that. Also python_speech_features accepts int16 values returned by scipy read function but librosa requires a float32, so you have to convert the read array or use librosa.load(). Here is a small snippet that includes the previous changes:

import librosa
import numpy as np
import python_speech_features
from scipy.io.wavfile import read


# init fname
fname = "sample.wav"

# read audio 
rate, audio = read(fname)

# using librosa 
lisbrosa_mfcc_feature = librosa.feature.mfcc(y=audio.astype(np.float32), 
                                             sr=rate,
                                             n_mfcc=96,
                                             n_fft=1024,
                                             win_length=int(0.025*rate),                                            
                                             hop_length=int(0.01*rate))
print(lisbrosa_mfcc_feature.T.shape)

# using python_speech_features
psf_mfcc_feature = python_speech_features.mfcc(signal=audio, 
                                               samplerate=rate, 
                                               winlen=0.025,
                                               winstep=0.01, 
                                               numcep=96,
                                               nfilt=96,
                                               nfft=1024, 
                                               appendEnergy=False)
print(psf_mfcc_feature.shape)


# check if size is the same
print(lisbrosa_mfcc_feature.shape == psf_mfcc_feature.shape)

I tested this and the output is the following:

(9003, 96)
(9001, 96)
False

It is not the exact same output but it just 2 frames difference.By the way the values won't be the same because each library is using a different approach to computing the MFCCs, python_speech_features uses discrete Fourier transform whereas librosa uses short time Fourier transform.

Getting 96 MFCC features using python_speech_features

Answers (1)

Related Questions