nevos
nevos

Reputation: 927

Input data for convolutional neural network

I am trying learn deep learning and specifically using convolutional neural networks. I'd like to apply a simple network on some audio data. Now, as far as I understand CNNs are often used for image and object recognition, and therefore when using audio people often use the spectrogram (specifically mel-spectrogram) instead of the signal in the time-domain. My question is, is it better to use an image (i.e. RGB or greyscale values) of the spectrogram as the input to the network, or should I use the 2d magnitude values of the spectrogram directly? Does it even make a difference?

Thank you.

Upvotes: 1

Views: 1274

Answers (2)

Shamane Siriwardhana
Shamane Siriwardhana

Reputation: 4201

Normally images have a local pattern. It's so natural so by applying a convolution window we can try to extract some local connectivity features So there's not an issue if you use some images of the spectrum in time or frequency domain. But the amazing question is what if we use spectrum data directly? I 'v seen a presentation where they applied CNN on next word prediction giving the contexts. In that thing inputs are word vectors . More importantly numbers. So they have used a CNN layers (rectangular shape filters) in order to extract features. So in this case if the data has some kind of natural pattern of generating this is perfectly fine.

Upvotes: 0

Prune
Prune

Reputation: 77827

The spectrogram is a lovely representation, especially for describing the process. Functionally, it's merely a simplification of the input data that adds no information, and loses a smidgen of accuracy -- which probably doesn't matter. The preprocessing doesn't buy you anything, so just use the 2d data and let the CNN take things from there.

Upvotes: 1

Related Questions