Additional audio feature extraction tips

I'm trying to create a speech emotion recognition model using Keras, I've done all of the code and have trained the model. It sits around 50% validation and is overfitting.

When i use model.predict() with unseen data it seems to have a hard time distinguishing between 'neutral', 'calm', 'happy' and 'suprised', but seems to be able to predict 'angry' correctly in the majority of cases - i assume because there's a clear difference in pitch or something.

I'm thinking it could possibly be that i'm not getting enough features from these emotions, which would help the model distinguish between them.

Currently i am using Librosa and coverting audio to MFCC's. Is there any other way, even using Librosa, that i can do to extract features for the model to help it better distinguish between the 'neutral', 'calm', 'happy', 'surprised' etc?

some feature extraction code:

wav_clip, sample_rate = librosa.load(file_path, duration=3, mono=True, sr=None)     
mfcc = librosa.feature.mfcc(wav_clip, sample_rate)

Also, this is with 1400 samples.

Upvotes: 1

Answers (1)

Lukasz Tracewski

Reputation: 11407

A few observations for starter:

Likely you have far too few samples to efficiently use neural networks. Use a simple algorithm for starter to understand well how your model is making prediction.
Make sure you have enough (30% or more) samples from different speakers put aside for final testing. You can use this test set only once, so think about building a pipeline to generate train, validation and test sets. Make sure you don't put the same speaker into more than 1 set.
First coefficient from librosa gives you AFAIK an offset. I'd recommend plotting how your features correlate with labels and how far they overlap, some can be easily confused I guess. Find if there are any feature that would differentiate your classes. Don't do this by running your model, do visual inspection first.

To the actual features! You're right to assume pitch should play a vital role. I'd recommend checking out aubio - it has Python bindings.

Yaafe also offers excellent selection of features.

You might easily end up with 150+ features. You might want to reduce dimensionality of the problem, perhaps even compress it to 2d and see if you can somehow separate the classes. Here is my own example with Dash.

Last but not least, some basic code to extract frequencies from the audio. In this case I am also trying to find three peak frequencies.

import numpy as np

def spectral_statistics(y: np.ndarray, fs: int, lowcut: int = 0) -> dict:
    """
    Compute selected statistical properties of spectrum
    :param y: 1-d signsl
    :param fs: sampling frequency [Hz]
    :param lowcut: lowest frequency [Hz]
    :return: spectral features (dict)
    """
    spec = np.abs(np.fft.rfft(y))
    freq = np.fft.rfftfreq(len(y), d=1 / fs)
    idx = int(lowcut / fs * len(freq) * 2)
    spec = np.abs(spec[idx:])
    freq = freq[idx:]

    amp = spec / spec.sum()
    mean = (freq * amp).sum()
    sd = np.sqrt(np.sum(amp * ((freq - mean) ** 2)))
    amp_cumsum = np.cumsum(amp)
    median = freq[len(amp_cumsum[amp_cumsum <= 0.5]) + 1]
    mode = freq[amp.argmax()]
    Q25 = freq[len(amp_cumsum[amp_cumsum <= 0.25]) + 1]
    Q75 = freq[len(amp_cumsum[amp_cumsum <= 0.75]) + 1]
    IQR = Q75 - Q25
    z = amp - amp.mean()
    w = amp.std()
    skew = ((z ** 3).sum() / (len(spec) - 1)) / w ** 3
    kurt = ((z ** 4).sum() / (len(spec) - 1)) / w ** 4

    top_peaks_ordered_by_power = {'stat_freq_peak_by_power_1': 0, 'stat_freq_peak_by_power_2': 0, 'stat_freq_peak_by_power_3': 0}
    top_peaks_ordered_by_order = {'stat_freq_peak_by_order_1': 0, 'stat_freq_peak_by_order_2': 0, 'stat_freq_peak_by_order_3': 0}
    amp_smooth = signal.medfilt(amp, kernel_size=15)
    peaks, height_d = signal.find_peaks(amp_smooth, distance=100, height=0.002)
    if peaks.size != 0:
        peak_f = freq[peaks]
        for peak, peak_name in zip(peak_f, top_peaks_ordered_by_order.keys()):
            top_peaks_ordered_by_order[peak_name] = peak

        idx_three_top_peaks = height_d['peak_heights'].argsort()[-3:][::-1]
        top_3_freq = peak_f[idx_three_top_peaks]
        for peak, peak_name in zip(top_3_freq, top_peaks_ordered_by_power.keys()):
            top_peaks_ordered_by_power[peak_name] = peak

    specprops = {
        'stat_mean': mean,
        'stat_sd': sd,
        'stat_median': median,
        'stat_mode': mode,
        'stat_Q25': Q25,
        'stat_Q75': Q75,
        'stat_IQR': IQR,
        'stat_skew': skew,
        'stat_kurt': kurt
    }
    specprops.update(top_peaks_ordered_by_power)
    specprops.update(top_peaks_ordered_by_order)
    return specprops

Upvotes: 2

Additional audio feature extraction tips

Answers (1)

Related Questions