Segmention instead of diarization for speaker count estimation

Question

I'm using diarization of pyannote to determine the number of speakers in an audio, where number of speakers cannot be predetermined. Here is the code to determine speaker count by diarization:

from pyannote.audio import Pipeline
MY_TOKEN = ""  # huggingface_auth_token
audio_file = "my_audio.wav"
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization@2.1", use_auth_token=MY_TOKEN)
output = pipeline(audio_file, min_speakers=2, max_speakers=10)
results = []
for turn, _, speaker in list(output.itertracks(yield_label=True)):
    results.append(speaker)
num_speakers = len(set(results))
print(num_speakers)

Using diarization for speaker count estimation seems an overkill and slow. So I was trying to segment the audio into chunks, embed the audio segments and do some clustering on the embeddings to determine the ideal number of clusters as the possible number of speakers. In the backend, pyannote might also be doing something similar to estimate number of speakers. Here is what I tried in code:

from sklearn.cluster import SpectralClustering, KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from spectralcluster import SpectralClusterer
from resemblyzer import VoiceEncoder, preprocess_wav
from pyannote.audio.pipelines.speaker_verification import PretrainedSpeakerEmbedding
from pyannote.audio import Model
from pyannote.audio import Audio
from pyannote.core import Segment
from pyannote.audio.pipelines import VoiceActivityDetection
import numpy as np


audio_file = "my_audio.wav"
MY_TOKEN = ""  # huggingface_token
embedding_model = PretrainedSpeakerEmbedding("speechbrain/spkrec-ecapa-voxceleb")
encoder = VoiceEncoder()
model = Model.from_pretrained("pyannote/segmentation", 
                              use_auth_token=MY_TOKEN)
pipeline = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {
  # onset/offset activation thresholds
  "onset": 0.5, "offset": 0.5,
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline(audio_file)
audio_model = Audio()

segments = list(vad.itertracks(yield_label=True))
embeddings = np.zeros(shape=(len(segments), 192))
#embeddings = np.zeros(shape=(len(segments), 256))

for i, diaz in enumerate(segments):
    print(i, diaz)
    waveform, sample_rate = audio_model.crop(audio_file, diaz[0])
    embed = embedding_model(waveform[None])
    #wav = preprocess_wav(waveform[None].flatten().numpy())
    #embed = encoder.embed_utterance(wav)
    embeddings[i] = embed
embeddings = np.nan_to_num(embeddings)

max_clusters = 10
silhouette_scores = []
# clustering = SpectralClusterer(min_clusters=2, max_clusters=max_clusters, custom_dist="cosine")
# labels = clustering.predict(embeddings)
# print(labels)

for n_clusters in range(2, max_clusters+1):
    # clustering = SpectralClustering(n_clusters=n_clusters, affinity='nearest_neighbors').fit(embeddings)
    # clustering = KMeans(n_clusters=n_clusters).fit(embeddings)
    clustering = AgglomerativeClustering(n_clusters).fit(embeddings)
    labels = clustering.labels_
    score = silhouette_score(embeddings, labels)
    print(n_clusters, score)
    silhouette_scores.append(score)

# Choose the number of clusters that maximizes the silhouette score
number_of_speakers = np.argmax(silhouette_scores) + 2  # add 2 to account for starting at n_clusters=2
print(number_of_speakers)

But the problem is that I'm not getting the same results as the results from pyannote diarization, especially when number of speakers is greater than 2. Pyannote diarization seems returning more realistic number. How to get the same results as pyannote diarization, but using some process that is faster like segmentation?

Segmention instead of diarization for speaker count estimation

Answers (1)

Related Questions