Reputation: 927
I am trying learn deep learning and specifically using convolutional neural networks. I'd like to apply a simple network on some audio data. Now, as far as I understand CNNs are often used for image and object recognition, and therefore when using audio people often use the spectrogram (specifically mel-spectrogram) instead of the signal in the time-domain. My question is, is it better to use an image (i.e. RGB or greyscale values) of the spectrogram as the input to the network, or should I use the 2d magnitude values of the spectrogram directly? Does it even make a difference?
Thank you.
Upvotes: 1
Views: 1274
Reputation: 4201
Normally images have a local pattern. It's so natural so by applying a convolution window we can try to extract some local connectivity features So there's not an issue if you use some images of the spectrum in time or frequency domain. But the amazing question is what if we use spectrum data directly? I 'v seen a presentation where they applied CNN on next word prediction giving the contexts. In that thing inputs are word vectors . More importantly numbers. So they have used a CNN layers (rectangular shape filters) in order to extract features. So in this case if the data has some kind of natural pattern of generating this is perfectly fine.
Upvotes: 0
Reputation: 77827
The spectrogram is a lovely representation, especially for describing the process. Functionally, it's merely a simplification of the input data that adds no information, and loses a smidgen of accuracy -- which probably doesn't matter. The preprocessing doesn't buy you anything, so just use the 2d data and let the CNN take things from there.
Upvotes: 1