Sarangan
Sarangan

Reputation: 967

Clustering algorithm for Voice clustering

What is the best Clustering methodology we can use in Voice domain ?

For example if we have the voice utterances from multiple speakers and we need to cluster them in to specific baskets where each of the baskets correspond to one speaker.For this what is the best clustering algorithm that we can use ?

Upvotes: 3

Views: 1442

Answers (2)

hithisispeter
hithisispeter

Reputation: 399

There are two approaches here: supervised classification as Eduardo suggests, or unsupervised clustering. Supervised requires training data (audio clips labeled with who is speaking) while unsupervised does not (although you do need some labeled examples to evaluate the method). Here I'll discuss unsupervised clustering.

The biggest difference is that an unsupervised model that works for this task can be applied to audio clips from new speakers, and any number of speakers!!!!! Supervised models will only work on the speakers, and number of speakers, on which they were trained. This is a huge limitation.

  • The most important element will be a way to encode each audio clip into a fixed-length vector such that the encoding somehow contains the needed information which is who is speaking. If you transcribed into text, this could be TF*IDF or BERT, which would pick out differences in topic, speech style, etc, but this would perform poorly if the clips of different speakers come from the same conversation. There's probably some pretrained encoder for voice clips that would work well here, not as familiar with these.
  • Clustering method: Simple k-means may work here, where k would be the number of people included in the dataset if known. If not known, you could use clustering metrics such as inertia and silhouette with the elbow heuristic to pick the optimal k, which may represent the number of speakers if your encoding is really good. Additionally, you could use a hierarchical method like agglomerative clustering if there is some inherent hierarchality in the voice clips such as half of the people talk only about science while the other half talk only about literature, or separating first by gender or age or something.
  • Evaluation: Use PCA to project each fixed-length vector encoding onto 2D so you can visualize it and assign each cluster's voice clips a unique color. This will show you which clusters are more similar to each other, and the organization of these clusters will show you what features are being represented by the encodings.

Pros and Cons of Unsupervised: Pros:

  • Is flexible to number of unique speakers and their voices. Meaning if you successfully build a clusterer that clusters audios based on their speaker, you can take this model and apply it to a totally different set of audios from different people, even a different number of people, and it will likely work similarly. A classifier would need to be trained on the unique people and the same number of people that it is applied to, otherwise it will not work.
  • No need for large labeled dataset, only enough examples to verify the program works. You can even do this after the fact by just listening to samples in one cluster and seeing if they sound like one person.

Cons:

  • It may not work. You have little control over what features are represented in the embedding, and thus determine cluster assignment. The way you control this is by picking a method of embedding that does this. An embedding method could be as simple as the average volume of the clip, but what would work better is taking the front half of a supervised model that someone else has trained on a voice task, effectively taking a hidden state from that model and using it as your embedding. If the task is similar to your task, such as a classifier to identify speaker, it will probably work well.
  • Hard to objectively compare unless you have a labeled test set

My suggestion: If you have a labeled set of voices, use half of this to train a classifier as Eduardo suggests, and use that model's hidden states has your embedding method, then send that to k-means, and use the other half of the labeled examples as a test set.

Upvotes: 3

Eduardo Gomes
Eduardo Gomes

Reputation: 567

I'd suggest RNN-LSTM. There is a great tutorial explaining about music genre classification using this neural network. I've watched it and it's very didatic to understand:

  1. First you have to understand your audio data (take a look here). In this link he explains MFCC (Mel Frequency Cepstral Coefficients), which allows you to extract features of your audio data into a spectogram. On image below, each amplitude of the MFCC represents a feature of the audio (e.g. features of the speaker voice). Each amplitude of the MFCC represents a feature of the audio (e.g. features of the speaker voice)
  2. Then you have to preprocess the data for the classification (practical example here)
  3. And then train your neural network to predict to which speaker the audio belongs. He shows here, but I'd recommend you watch the entire series. I think it's the best I've seen about this topic, giving all the background, code anda dataset necessary to solve such speaker classificatin problem.

Hope you enjoy the links, they've really helped me and sure they will solve your question.

Upvotes: 3

Related Questions