Streetlamp
Streetlamp

Reputation: 1556

Detecting duplicate audio files

I have snippets of audio that are almost the same that I want to group together (samples 5 and 3 below). There are other portions that are similar, but differ (3 and 4, there is a double drum hit at the end for 3) and completely different ones (sample 8).

screenshot of waveforms of audio samples

How can I group together samples that are almost the same? I tried taking the difference (attempting to minimize it), but that does not work since they are not aligned. I also tried to take audio features like pitch distribution, but since the sounds are similar in pitch those don't get separated well.

The files are available here: https://drive.google.com/drive/folders/14UQQDfIBUNRO_1Pv8bkPf9noi86M7lKd

Upvotes: 0

Views: 294

Answers (1)

skumhest
skumhest

Reputation: 391

Here's something that appears to work for the data you are using but may (likely does) have weaknesses when it comes to other data or other sorts of data. But maybe it will be helpful nonetheless.

The basic idea of this solution is to compute the MFCCs of each of the samples to get feature vectors and then find a distance (here just using basic Euclidean distance) between those feature sets with the assumption (which seems to be true for your data) that the least similar samples will have a large distance and the closest will have the least. Here's the code:

import librosa
import scipy
import matplotlib.pyplot as plt

sample3, rate = librosa.load('sample3.wav', sr=None)
sample4, rate = librosa.load('sample4.wav', sr=None)
sample5, rate = librosa.load('sample5.wav', sr=None)
sample8, rate = librosa.load('sample8.wav', sr=None)

# cut the longer sounds to same length as the shortest
len5 = len(sample5)
sample3 = sample3[:len5]
sample4 = sample4[:len5]
sample8 = sample8[:len5]

mf3 = librosa.feature.mfcc(sample3, sr=rate)
mf4 = librosa.feature.mfcc(sample4, sr=rate)
mf5 = librosa.feature.mfcc(sample5, sr=rate)
mf8 = librosa.feature.mfcc(sample8, sr=rate)

# average across the frames. dubious?
amf3 = mf3.mean(axis=0)
amf4 = mf4.mean(axis=0)
amf5 = mf5.mean(axis=0)
amf8 = mf8.mean(axis=0)

f_list = [amf3, amf4, amf5, amf8]

results = []
for i, features_a in enumerate(f_list):
    results.append([])
    for features_b in f_list:
        result = scipy.spatial.distance.euclidean(features_a,
                                                  features_b)
        results[i].append(result)

plt.ion()
fig, ax = plt.subplots()

ax.imshow(results, cmap='gray_r', interpolation='nearest')

spots = [0, 1, 2, 3]
labels = ['s3', 's4', 's5', 's8']
ax.set_xticks(spots)
ax.set_xticklabels(labels)
ax.set_yticks(spots)
ax.set_yticklabels(labels)

The code plots a heatmap of the distances across all the samples. The code is lazy so it both re-computes the elements that are symmetric across the diagonal, which are the same, and the diagonal itself (which should be zero distance) but those are sort of sanity checks as it is nice to see white down the diagonal and that the matrix is symmetric.

The real information is that clip 8 is black against all the other clips (i.e. furthest from them) and clip 3 and clip 5 are the least distant from one another.

This basic idea could be done with a feature vector generated in a different sort of way (e.g. instead of MFCCs, you could use the embeddings from something like YAMNet) or with a different way of finding a distance between the feature vectors.

For the grouping part of what you want to do, you could experimentally work out a threshold on the distance metric below which you would consider a clip to be in the same group as another. With more clips, you could compute all these distances and then hand that distance matrix over to a clustering algorithm (like HDBSCAN) to cluster the clips.

Upvotes: 1

Related Questions