How to process input data for audio classification using CNN with PyTorch?

Question

As an engineer student works towards DSP and ML fields, I am working on an audio classification project with inputs being short clips (4 sec.) of instruments like bass, keyboard, guitar, etc. (NSynth Dataset by the Magenta team at Google).

The idea is to convert all the short clips (.wav files) to spectrograms or melspectrograms then apply a CNN to train the model.

However, my questions is since the entire dataset is large (approximately 23GB), I wonder if I should firstly convert all the audio files to images like PNG then apply CNN. I feel like this can take a lot of time, and it will double the storage space for my input data as now it is audio + image (maybe up to 70GB).

Thus, I wonder if there is any workaround here that can speed the process.

Thanks in advance.

Jindřich · Accepted Answer

Preprocessing is totally worth it. You will very likely end up, running multiple experiments before your network will work as you want it to and you don't want to waste time pre-processing the features every time, you want to change a few hyper-parameters.

Rather than using PNG, I would rather save directly PyTorch tensors (torch.save that uses Python's standard pickling protocols) or NumPy arrays (numpy.savez saves serialized arrays into a zip file). If you are concerned with disk space, you can consider numpy.save_compressed.

How to process input data for audio classification using CNN with PyTorch?

Answers (1)

Related Questions