Digil
Digil

Reputation: 69

Segmention instead of diarization for speaker count estimation

I'm using diarization of pyannote to determine the number of speakers in an audio, where number of speakers cannot be predetermined. Here is the code to determine speaker count by diarization:

from pyannote.audio import Pipeline
MY_TOKEN = ""  # huggingface_auth_token
audio_file = "my_audio.wav"
pipeline = Pipeline.from_pretrained("pyannote/[email protected]", use_auth_token=MY_TOKEN)
output = pipeline(audio_file, min_speakers=2, max_speakers=10)
results = []
for turn, _, speaker in list(output.itertracks(yield_label=True)):
    results.append(speaker)
num_speakers = len(set(results))
print(num_speakers)

Using diarization for speaker count estimation seems an overkill and slow. So I was trying to segment the audio into chunks, embed the audio segments and do some clustering on the embeddings to determine the ideal number of clusters as the possible number of speakers. In the backend, pyannote might also be doing something similar to estimate number of speakers. Here is what I tried in code:

from sklearn.cluster import SpectralClustering, KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from spectralcluster import SpectralClusterer
from resemblyzer import VoiceEncoder, preprocess_wav
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Model
from pyannote.audio import Audio
from pyannote.core import Segment
from pyannote.audio.pipelines import VoiceActivityDetection
import numpy as np


audio_file = "my_audio.wav"
MY_TOKEN = ""  # huggingface_token
embedding_model = PretrainedSpeakerEmbedding("speechbrain/spkrec-ecapa-voxceleb")
encoder = VoiceEncoder()
model = Model.from_pretrained("pyannote/segmentation", 
                              use_auth_token=MY_TOKEN)
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # onset/offset activation thresholds
  "onset": 0.5, "offset": 0.5,
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline(audio_file)
audio_model = Audio()

segments = list(vad.itertracks(yield_label=True))
embeddings = np.zeros(shape=(len(segments), 192))
#embeddings = np.zeros(shape=(len(segments), 256))

for i, diaz in enumerate(segments):
    print(i, diaz)
    waveform, sample_rate = audio_model.crop(audio_file, diaz[0])
    embed = embedding_model(waveform[None])
    #wav = preprocess_wav(waveform[None].flatten().numpy())
    #embed = encoder.embed_utterance(wav)
    embeddings[i] = embed
embeddings = np.nan_to_num(embeddings)

max_clusters = 10
silhouette_scores = []
# clustering = SpectralClusterer(min_clusters=2, max_clusters=max_clusters, custom_dist="cosine")
# labels = clustering.predict(embeddings)
# print(labels)

for n_clusters in range(2, max_clusters+1):
    # clustering = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors').fit(embeddings)
    # clustering = KMeans(n_clusters=n_clusters).fit(embeddings)
    clustering = AgglomerativeClustering(n_clusters).fit(embeddings)
    labels = clustering.labels_
    score = silhouette_score(embeddings, labels)
    print(n_clusters, score)
    silhouette_scores.append(score)

# Choose the number of clusters that maximizes the silhouette score
number_of_speakers = np.argmax(silhouette_scores) + 2  # add 2 to account for starting at n_clusters=2
print(number_of_speakers)

But the problem is that I'm not getting the same results as the results from pyannote diarization, especially when number of speakers is greater than 2. Pyannote diarization seems returning more realistic number. How to get the same results as pyannote diarization, but using some process that is faster like segmentation?

Upvotes: 1

Views: 2185

Answers (1)

App Rank Portal
App Rank Portal

Reputation: 11

It is not surprising that the two methods are giving different results. Speaker diarization and speaker clustering are two different approaches to the same problem of speaker counting, and they make different assumptions about the data and the problem.

Speaker diarization relies on techniques like speaker change detection and speaker embedding to segment the audio into regions that correspond to different speakers, and then assigns each segment to a unique speaker label. This approach is robust to various sources of variation in the audio, such as overlapping speech, background noise, and speaker characteristics, but it can be computationally expensive.

Speaker clustering, on the other hand, assumes that the audio can be divided into a fixed number of non-overlapping segments, and attempts to group them into clusters that correspond to different speakers based on some similarity metric. This approach is faster than diarization but may not be as accurate, especially when the number of speakers is not known a priori.

To improve the accuracy of your speaker clustering approach, you may want to consider incorporating some of the techniques used in diarization, such as voice activity detection and speaker embedding. For example, you could use a VAD algorithm to segment the audio into regions of speech and non-speech, and then apply clustering to the speech regions only. You could also use a pre-trained speaker embedding model to extract features from the speech regions and use them as input to your clustering algorithm.

Overall, it is unlikely that you will be able to achieve the same level of accuracy as diarization using clustering alone, but you may be able to get close by combining the two approaches.

Upvotes: 1

Related Questions