Find the best decibel threshold to split an audio into segments with and without human voice in Python

Question

I am trying to split audio into segments with and without human voice. I've started to use the method split from librosa and it does a really good job. The only problem I am having is to define the best threshold for silence.

This method has an argument top_db (in decibels) that considers everything below it as silence. Currently, I am using a hardcoded value of 40 dB. For some audio, it works fine but for others not so much.

Is there a way to discover the best top_db threshold for each audio signal? Maybe considering the signal amplitude or average dB. Or normalize the audio wave amplitude before processing it, so a given top_db can perform well in most of the audio.

So far I have the following code:

import librosa
import numpy as np

from pydub import AudioSegment


def to_normalized_array(audio_chunk, fs, librosa_fs):
   samples = audio_chunk.get_array_of_samples()
   arr = np.array(samples).astype(np.float32) / np.iinfo(np.int16).max
   return librosa.core.resample(arr, fs, librosa_fs)


audio_chunk = AudioSegment.from_wav("audio.wav")
audio_chunk = audio_chunk.set_sample_width(2).set_channels(1).set_frame_rate(16000)

fs = 16000
librosa_fs = 22050
top_db = 40

arr = to_normalized_array(audio_chunk, fs, librosa_fs)
edges = librosa.effects.split(arr, top_db=top_db) / librosa_fs

Thanks in advance,

Rhenan

Find the best decibel threshold to split an audio into segments with and without human voice in Python

Answers (1)

Related Questions