Neil
Neil

Reputation: 81

Finding similarity between two audio signal spoken by two different person

I have one master.wav file of 14 seconds and another child.wav file of 221 seconds divided into total 207 chunks each of 14 seconds. Now i want to compare each child chunks with master file and want to find the similarity among them. Hypothesis is that the child chunk which will have the highest similarity will contains the exactly or some how same words spoken in master file. I am using pyaudioanalysis library to extract features of .wav file (https://github.com/tyiannak/pyAudioAnalysis)

Upvotes: 1

Views: 2920

Answers (3)

whkang
whkang

Reputation: 360

You can extract an embedding vector from each chunk and compute their cosine similarity (or other distance metrics if you want). An embedding vector is a fixed-dimensional vector (which enables you to compare speech samples with different durations) summarizing the global information (e.g., speaker identity) within the given speech. The embedding vectors can be extracted using encoder modules trained for distribution representation or speaker recognition. Here are some popular embedding methods:

  1. i-vector: trained to summarize the distributive pattern of the given speech
  2. deep learning-based embeddings (e.g., x-vector): trained to contain speaker discriminative information

These methods are useful to compare speech samples uttering different sentences (or words) as they are usually optimized for text-independent speaker verification. If you don't have any large-scale training data available for training such models, luckily there are some pre-trained models publically available:

  1. https://github.com/clovaai/voxceleb_trainer
  2. https://kaldi-asr.org/models/m7

Upvotes: 1

Jon Nordby
Jon Nordby

Reputation: 6259

You can try to compute MFCC as features and use DTW as the distance metric.

Upvotes: 2

qmeeus
qmeeus

Reputation: 2402

This questions would require a whole course on speech recognition 101 to answer but to make it short:

  • The waveform of the same word by two different speakers will not be more similar than other random things. You should rely on feature extractions to identify the formants to be able to identify phonemes, then words (See here) or use a machine learning approach that will do more or less the same thing for you (HMMs or neural networks)
  • A child and an adult will typically pronounce vowels using different frequency ranges. This is why you cannot use any distance metric to cluster words spoken by people from different gender / age group
  • If you still want to carry on with your approach despite what I say in the previous points, you have to decide what will constitute a good similarity measure in your case: you have many choices, for example: MSE, MAE, MAPE, etc.

Upvotes: 0

Related Questions