Reputation: 311605

What data structures are used to encode a trained voice model?

What does a trained voice model look like? That is:

What are the typical data structures which encode a useful fingerprint of someone's voice?
How is a voice sample compared to the model for evaluation to decide whether it's a match or not?

I understand there's probably some variety in the implementations, so any popular example from either academic literature or a successful implementation would be great.

Upvotes: 1

Answers (2)

Nikolay Shmyrev

Reputation: 25220

What are the typical data structures which encode a useful fingerprint of someone's voice?

Modern approach is based on factor vectors called i-vectors. I-vector is a real vector of 100-400 elements. It characterize speakers pretty well.

You can learn more about i-vectors from the tutorial.

Originally i-vectors were extracted with GMM models, in state of the art DNN detectors are used.

How is a voice sample compared to the model for evaluation to decide whether it's a match or not?

I-vectors are compared with cosine distance between them.

I understand there's probably some variety in the implementations, so any popular example from either academic literature or a successful implementation would be great.

There are number of implementations, you can get best results from Kaldi

Upvotes: 2

Rob

Reputation: 1131

To create a person model:

Tipically, in voice biometric you have a long record of someone's voice.

Then you split the record into small portions of miliseconds and you extract features of these portions. The most extended features are the Mel Frequency Cepstrum Coefficients (MFCCs):

https://en.wikipedia.org/wiki/Mel-frequency_cepstrum

Once you have a dataset (the MFCC of a lot of small portions of voice) you can model the voice obtaining a probability density distribution of the MFCCs using an algorithm like Gaussian Mixture Models (GMMs):

https://en.wikipedia.org/wiki/Mixture_model#Gaussian_mixture_model

To predict

Imagine that you have now several voice models for several people.

When you have a new voice record a you need to split the new voice record again and extract the MFCCs.

Then you can obtain the probability that the new samples belong to each one of your models.

If the probability is higher than a threshold you have a match.

Upvotes: 2

What data structures are used to encode a trained voice model?

Answers (2)

Related Questions