How should audio be pre-processed for classification?

Question

I am currently developing an audio classifier with the Python API of TensorFlow, using the UrbanSound8K dataset and trying to distinguish between 10 mutually exclusive classes.

The audio files are 4 seconds long and contain 176400 data points which results in serious memory issues. How should the audio be pre-processed to reduce memory usage?

And how can more useful features be extracted from the audio (using convolution and pooling)?

Stefan Kahl · Accepted Answer

I personally prefer spectrograms as input for neural nets when it comes to sound classification. This way, raw audio data is transformed into an image representation and you can treat it like a basic image classification task.

There are a number of ways to choose from, here is what I usually do using scipy, python_speech_features and pydub:

import numpy as np
import scipy.io.wavfile as wave
import python_speech_features as psf
from pydub import AudioSegment

#your sound file
filepath = 'my-sound.wav'

def convert(path):

    #open file (supports all ffmpeg supported filetypes) 
    audio = AudioSegment.from_file(path, path.split('.')[-1].lower())

    #set to mono
    audio = audio.set_channels(1)

    #set to 44.1 KHz
    audio = audio.set_frame_rate(44100)

    #save as wav
    audio.export(path, format="wav")

def getSpectrogram(path, winlen=0.025, winstep=0.01, NFFT=512):

    #open wav file
    (rate,sig) = wave.read(path)

    #get frames
    winfunc=lambda x:np.ones((x,))
    frames = psf.sigproc.framesig(sig, winlen*rate, winstep*rate, winfunc)

    #Magnitude Spectrogram
    magspec = np.rot90(psf.sigproc.magspec(frames, NFFT))

    #noise reduction (mean substract)
    magspec -= magspec.mean(axis=0)

    #normalize values between 0 and 1
    magspec -= magspec.min(axis=0)
    magspec /= magspec.max(axis=0)

    #show spec dimensions
    print magspec.shape    

    return magspec

#convert file if you need to
convert(filepath)

#get spectrogram
spec = getSpectrogram(filepath)

First, you need to standardize your audio files in terms of sample rate and channels. You can do that (and more) with the excellent pydub package.

After that, you need to transform your audio signal into an image with FFT. You can do that with scipy.io.wavefile and the sigproc modul of python_speech_features. I like the magnitude spectrogram, rotate it 90 degrees, normalize it and use the resulting NumPy array as input for my convnets. You can change the spatial dimensions of the spectrogram by adjusting the values of winstep and NFFT to fit your input size.

There might be easier ways to do all that; I achieved good overall classification results using the code above.

How should audio be pre-processed for classification?

Answers (1)

Related Questions