Reputation: 125
I want to train my model using 96 MFCC Features. I used Librosa and I didnt get a promising result. I then tried to use python_speech_features, however I can get no more than 26 features! why! This is the shape for the same audio file
using Librosa
x = librosa.feature.mfcc(audio, rate, n_mfcc=96)
x.shape # (96, 204)
using python_speech_features
mfcc_feature = pySpeech.mfcc(audio, rate, 0.025, 0.01, 96, nfft=1200, appendEnergy = True)
mfcc_feature.shape # output => (471, 26)
Any Thoughts!
Upvotes: 3
Views: 2009
Reputation: 2966
So the implementations of librosa
and python_speech_features
differ from each other, structure-wise and even theory-wise. Based on the docs:
You will notice that the outputs are different, librosa mfcc output shape = (n_mels, t)
whereas python_speech_features output = (num_frames, num_cep)
, so you need to transpose one of the two. Also you will notice that any num_ceps
value above 26 in python_speech_features
won't change a thing in the returned mfccs num_ceps
that is because you are limited by the number of filters used. Therefore, you will have to increase that too. Moreover, you need to make sure that the framing is using similar values (one is using samples count and the other durations) so you will have to fix that. Also python_speech_features
accepts int16 values returned by scipy read function but librosa requires a float32, so you have to convert the read array or use librosa.load()
. Here is a small snippet that includes the previous changes:
import librosa
import numpy as np
import python_speech_features
from scipy.io.wavfile import read
# init fname
fname = "sample.wav"
# read audio
rate, audio = read(fname)
# using librosa
lisbrosa_mfcc_feature = librosa.feature.mfcc(y=audio.astype(np.float32),
sr=rate,
n_mfcc=96,
n_fft=1024,
win_length=int(0.025*rate),
hop_length=int(0.01*rate))
print(lisbrosa_mfcc_feature.T.shape)
# using python_speech_features
psf_mfcc_feature = python_speech_features.mfcc(signal=audio,
samplerate=rate,
winlen=0.025,
winstep=0.01,
numcep=96,
nfilt=96,
nfft=1024,
appendEnergy=False)
print(psf_mfcc_feature.shape)
# check if size is the same
print(lisbrosa_mfcc_feature.shape == psf_mfcc_feature.shape)
I tested this and the output is the following:
(9003, 96)
(9001, 96)
False
It is not the exact same output but it just 2 frames difference.By the way the values won't be the same because each library is using a different approach to computing the MFCCs, python_speech_features
uses discrete Fourier transform whereas librosa uses short time Fourier transform.
Upvotes: 2