Reputation: 81
I have one master.wav file of 14 seconds and another child.wav file of 221 seconds divided into total 207 chunks each of 14 seconds. Now i want to compare each child chunks with master file and want to find the similarity among them. Hypothesis is that the child chunk which will have the highest similarity will contains the exactly or some how same words spoken in master file. I am using pyaudioanalysis library to extract features of .wav file (https://github.com/tyiannak/pyAudioAnalysis)
Upvotes: 1
Views: 2920
Reputation: 360
You can extract an embedding vector from each chunk and compute their cosine similarity (or other distance metrics if you want). An embedding vector is a fixed-dimensional vector (which enables you to compare speech samples with different durations) summarizing the global information (e.g., speaker identity) within the given speech. The embedding vectors can be extracted using encoder modules trained for distribution representation or speaker recognition. Here are some popular embedding methods:
These methods are useful to compare speech samples uttering different sentences (or words) as they are usually optimized for text-independent speaker verification. If you don't have any large-scale training data available for training such models, luckily there are some pre-trained models publically available:
Upvotes: 1
Reputation: 6259
You can try to compute MFCC as features and use DTW as the distance metric.
Upvotes: 2
Reputation: 2402
This questions would require a whole course on speech recognition 101 to answer but to make it short:
Upvotes: 0