Reputation: 1689

Audio Feature for Deep Learning

I found some papers and slides using deep learning for audio classification.

Some researches used spectrogram as the input of deep learning models.

I want to know exact and practical implementation.

Page 67

From my understanding, the node number of the first layer is 24 and the input is spectrogram of 24 different time periods.

For example, if a audio event is 2.4-second, the first node is spectrogram of 0~0.1 second, the second node is spectrogram of 0.1~0.2 second ...

Did I misunderstand ?

My question: if there is a 3.0-second audio event, how to classify it ?

Upvotes: 4

Answers (3)

pietz

Reputation: 2553

I trained a CNN to detect the language that was spoken in an audio recording. It currently supports 176 languages with an accuracy of 98.8%. I have a fairly well commented Jupyter Notebook on my GitHub account: Spoken Language Classifier.

I expect that this is what you're looking for. Some of the things I learned include:

The architecture doesn't need to be recurrent, because time can be encoded on the x axis. For a non-recurrent CNN the length of your input has to be fixed though.
Spectrograms are semantically different to photographs in many respects. Popular architecures that work well for photos, may be complete overkill for spectrograms.
Experiment with different resolutions for x and y individually. My first assumption that the time axis needs higher resolution than the frequency axis was wrong in my use case.
Use a mel-spectrogram to give higher resolution to lower frequency. Our hearing works exponentially and not linear.

Upvotes: 0

Amajid Sinar

Reputation: 166

I managed to classify time series data using Convolutional Neural Network. Convolution Neural Network is basically the same as Artificial Neural Network. The only difference is that, the input to the ANN must be convolved first to extract specific features. In an intuitive way, convolution operation basically highlights specific features of some data. It is best depicted by flashlight shined through different parts of the images. By doing so, we can highlight specific features of the image.

That's the main idea of CNN. It is inherently designed to extract spatial features. The convolution operation is usually stacked, which means you have (row,column,dimensions) so the output of the convolution is 3 dimension. The downside of this process is large computation time. To reduce that, we need pooling or downsampling which basically reduce the size of the feature detectors without losing essential features/information. For example before pooling you have 12 of 6,6 matrix as feature detectors. And after pooling you have 12 convolved data with size of 3,3. You can do these two steps over and over again before flattening which basically squash all those into (n,1) dimensional array. Afterwards, you can do normal ANN steps.

In short, the steps to classify time series data can be done using CNN. Here are the steps:

1.Convolution
2.Pooling
3.Flattening
4.Full connection (normal ANN steps)

You can add convolution and pooling layers as much as you like, but watch out for training time. There's this video by my favourite youtuber, Siraj Raval. By the way, I suggest you to use Keras for Deep Learning. Hands down the easiest deep learning library to use. Hope it helps.

Upvotes: 3

Morteza Shahriari Nia

Reputation: 1452

You should use Kaldi. CTC takes care of temporal resolutions.

Upvotes: 1

Audio Feature for Deep Learning

Answers (3)

Related Questions