The6thSense
The6thSense

Reputation: 8335

Compare two audio files with persons speaking and compute the similarity score

Big picture: Trying to identify proxy frauds in video interviews.

I have video clips of interviews. Each person has 2 or more interviews. As a first step I am trying to extract the audio from the interviews and trying to match them and identify if audio is from the same person.

I used python library librosa to parse the audio files and generate MFCC and chroma_cqt features of those files. I went ahead to also create a similarity matrix for those files. I want to convert this similarity matrix to a score between 0 to 100 where 100 is perfect match and 0 is totally different. After which I can identify a threshold and provide labels to the audio files.

Code:

import librosa

hop_length = 1024
y_ref, sr1 = librosa.load(r"audio1.wav")
y_comp, sr2 = librosa.load(r"audio2.wav")
chroma_ref = librosa.feature.chroma_cqt(y=y_ref, sr=sr1, hop_length=hop_length)
chroma_comp = librosa.feature.chroma_cqt(y=y_comp, sr=sr2, hop_length=hop_length)

mfcc1 = librosa.feature.mfcc(y_ref, sr1, n_mfcc=13)
mfcc2 = librosa.feature.mfcc(y_comp, sr2, n_mfcc=13)


# Use time-delay embedding to get a cleaner recurrence matrix
x_ref = librosa.feature.stack_memory(chroma_ref, n_steps=10, delay=3)
x_comp = librosa.feature.stack_memory(chroma_comp, n_steps=10, delay=3)

sim = librosa.segment.cross_similarity(x_comp, x_ref, metric='cosine')

Upvotes: 1

Views: 3585

Answers (2)

Anas sain
Anas sain

Reputation: 259

To compare two audio files with speakers and compute a similarity score, you can use the pyannote/embedding or pyannote/wespeaker-voxceleb-resnet34-LM models. I have used both models. Here’s a quick breakdown:

  1. Pyannote/embedding:

This model generates speaker embeddings from audio but tends to be less accurate for distinguishing between different speakers. It works by calculating cosine similarity: the closer the score is to 1, the more likely the speakers are different. A lower score suggests they’re the same speaker.

You can try out the model use the below url of hugging face Url: https://huggingface.co/pyannote/embedding

  1. Pyannote/wespeaker-voxceleb-resnet34-LM:

This one is a step up! Fine-tuned on the VoxCeleb dataset, it’s much better at verifying speakers. Like the previous model, it also computes embeddings and uses cosine similarity, but it generally yields more reliable results.

Url: https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM

Upvotes: 0

Jon Nordby
Jon Nordby

Reputation: 6299

The task of identifying who is talking is called Speaker Identification. Checking whether two audio clips have the same speaker Speaker Verification. If there are multiple speakers in dialog, then it may also be relevant to do Speaker Diarization, finding out who-talks-when. That would enable focus on the interview subject and not the interviewer.

Speaker recognition tasks like these are best solved with a deep neural network, as it is quite difficult task to separate the speaker from the words that are spoken. The models generally output a speaker embedding - a vector representation that encodes similarity of different person's speech. Then one can apply a simple similarity metric on this representation, such as cosine distance.

There are pretrained models available for this. For example in pyannote-audio and in SpeechBrain.

Upvotes: 2

Related Questions