Simon Kemper
Simon Kemper

Reputation: 645

How to compare audio on similarity in Python?

I am using Python based audio library librosa to analyze musical audio tracks on note onset events. With this information I am slicing those tracks into several smaller, very short pieces / slices - all based in the note onset events.

Having those slices I am analyzing them using the librosa built in tools for feature extraction like chromagram or MFCC. The output looks like:

librosa.feature.chroma_stft(y=y, sr=sr)
array([[ 0.974,  0.881, ...,  0.925,  1.   ],
       [ 1.   ,  0.841, ...,  0.882,  0.878],
       ...,
       [ 0.658,  0.985, ...,  0.878,  0.764],
       [ 0.969,  0.92 , ...,  0.974,  0.915]])

librosa.feature.mfcc(y=y, sr=sr)
array([[ -5.229e+02,  -4.944e+02, ...,  -5.229e+02,  -5.229e+02],
       [  7.105e-15,   3.787e+01, ...,  -7.105e-15,  -7.105e-15],
       ...,
       [  1.066e-14,  -7.500e+00, ...,   1.421e-14,   1.421e-14],
       [  3.109e-14,  -5.058e+00, ...,   2.931e-14,   2.931e-14]])

As we can see these functions put out a matrix which holds up the information about the extracted features. All those informations (features, slice start and end, filename) will be stored into a (sqlite) database. The sliced audio-data will be released.

The features describe the "type" / sound of the analyzed audio numerically and are a good base to make similarity calculations.

Having all this information (and a large database with hundreds of analyses tracks) I want to be able to pick a random slice and compare it against all the other slices in the database to find the one that's most similar to the picked one - based on the extracted feature information.

What do I need to do to compare the result of the above listed functions on similarity?

Upvotes: 8

Views: 16880

Answers (3)

Jon Nordby
Jon Nordby

Reputation: 6259

There are many possible definitions of audio similarity. For musical notes one should thing about whether changes in the following should be considered dissimilar or not:

  • Pitch shifts
  • Timbre differences
  • Amplitude shifts
  • Amplitude curves over time
  • Reverberation
  • Time shifts
  • Time stretching

If one computes similarity on MFCC or chromagram it will cover almost all the above attributes. Any change will be dissimilar. Such similarity can be computed by taking the Euclidean distance. Standardizing the data first is useful to smooth out feature scaling differences. It is usually useful to at least make it not sensitive to amplitude, which can be done by normalizing the audio level before computing the difference.

Invariance to time shifts and stretches can be achieved by using Dynamic Time Warping instead of Euclidean distance.

If one wants to find out whether two audio are the same note, regardless of instruments/effects etc, then one should extract the F0 fundamental frequency and only use that for similarity.

On the other side, if one wants notes from the same instrument, then one needs to implement timbre extraction, and do similarity on the resulting features.

It is also possible to learn a similarity function using supervised learning. This requires a representative dataset with annotations representing the desired similarity.

Upvotes: 0

Pianistprogrammer
Pianistprogrammer

Reputation: 637

Librosa has a segment_cross_similiarity function you can use to do this task, you only need to decide which features you want to cross-check

Upvotes: 4

milahu
milahu

Reputation: 3529

ranking is the problem you describe.

you must find a "good formula"
to reduce "all the dimensions" into one dimension
--> similarity, proximity, closeness, rank.

general formula for a "weighted sum":

rank(o, x)  =  w_1*(x_1 - o_1)^e_1  +  w_2*(x_2 - o_2)^e_2  +  ...

with the origin (o_1 o_2 ...) = your needle, the one slice you pick
and the point (x_1 x_2 ...) = your haystack, all the other slices
and the weights (w_1 w_2 ...)
and the exponents (e_1 e_2 ...)

weights and exponents are a simple way to "fine tune" your formula.
if your dimensions would be orthogonal, the exponent simply would be two --> cartesian geometry.
but in "real world" data analysis, dimensions are always corelated = not orthogonal,
and you will need to guess parameters
(and group similar dimensions into more complex summands),
to get acceptable results.

another option is the sledgehammer of "machine learning",
but then you must train your own model,
and you also must find a way to rank your files.

related:

Upvotes: 1

Related Questions