How can audio data be abstracted for comparison purposes?

Question

I am working on a project involving machine learning and data comparison.

For the purpose of this project, I am feeding abstracted video data to a neuronal network.

Now, abstracting image data is quite simple. I can take still-frames at certain points in the video, scale them down into 5 by 5 pixels (or any other manageable resolution) and get the pixel values for analysis.

The resulting data gives a unique, small and somewhat data-rich sample (even 5 samples of 5x5 px are enough to distinguish a drama from a nature documentary, etc).

However, I am stuck on the audio part. Since audio consists of samples and each sample by itself has no inherent meaning, I can't find a way to abstract audio down into processable blocks.

Are there common techniques for this process? If not, what metrics can audio data be quantified and abstracted in?

marko · Accepted Answer

The process you require is audio feature extraction. A large number of feature detection algorithms exist, usually specialising in signals that are music or speech. For music, chromacity, rhythm, harmonic distribution are all features you might extract - along with many more. Typically, audio feature extraction algorithms work at a fairly macro level - that is to say thousands of samples at a time.

A good place to get started is Sonic visualiser which is a plug-in host for audio visualisation algorithms - many of which are feature extractors.

YAAFE may also have some useful stuff in it.

How can audio data be abstracted for comparison purposes?

Answers (1)

Related Questions