Reputation: 1394
I'm trying to build a C# app that detects when music is present in a video. I can get at the Audio find, in whatever format required. I however have hit a brick wall in music detection.
There are loads of posts about audio fingerprinting and how to do that in C#/any language. However, I want rough in/out times that music occurs in a film, I'm not concerned what the music is.
The music is unlikely to exist in any fingerprint databases. So would likely be an entirely computational analysis.
Are there any clever ideas? Or am I going to be best implementing a beat detection algorithm and processing it piece by piece. Then estimating in/out points?
Upvotes: 5
Views: 1870
Reputation: 949
The OP's problem can be summarized as follows:
In the generalized audio stream of a video, try to detect "music" versus "everything else".
Where "music" is not likely to exist in fingerprint databases.
And where "everything else" in this context must include:
We must also assume that the audio soundtrack of a generalized video is highly processed with echo, reverb, multichannel panning, etc.
In the general video case, all of the above audio elements would be mixed together into the final audio, making the problem domain absolutely immense.
This is a very challenging problem, with most likely no simple or robust solution.
In support of this premise, a general music classifier (let's call it MuCLAS), where the unknown music sample is a member of the classifier training set, is a very difficult problem, due to the significant expense involved in creating the training set, and in tuning and creating the classifier index.
But the OP's problem domain is much larger than the MuCLAS problem domain, due to the much higher entropy of the OP's unknown data set. This implies much higher complexity and cost, relative to MuCLAS.
Another supporting argument for the above premise, is that the state of the art in general speech recognition assumes and insists upon, much lower entropy in the unknown data set, than the implied entropy of the OP's data set.
And speech recognition is one of the best funded problems in the general field of autonomous pattern recognition.
Upvotes: 0
Reputation: 56725
There are only two things that I can think of that clearly distinguish "Music" from all other Audio/sounds:
Meter: Virtually all composed music has a meter. In theory this should be detectable with an FFT, but using the frequency range of apprx. 0.25hz to 10hz (instead of the usual 20hz-20Khz). In practice? I don't know, but it seems worth a try.
Tuning: Something common to almost all professional music including the voices of professional singers (when they are musically accompanied), but not to any other sounds is that they will all be in the same "tuning" of a 12-tone Equal Tempered scale. In other words, their frequencies will always be separated by exact multiple powers of 2^(1/12). Once the tuning is established they will never be in the gaps in between these steps. Normal sounds, including human voices, are spread all over the spectrum but music is almost always within +/- 10 Cents of a scaled note.
Method #1 is iffy, I don't know if anyone's ever tried it.
But #2 is definite, you can actually see this with an Audio Spectrum Analyzer, but the FFT has to have very high discrimination (at least 36 divisions per octave). But there are some catches, such as:
Well, those are my "clever" ideas. Now it's just a small matter of implementation ... ;-)
Upvotes: 4
Reputation: 3584
you can use 'Microsoft Expression Encoder' to work with videos and audios
Upvotes: 0