Rhenan Bartels
Rhenan Bartels

Reputation: 493

Find the best decibel threshold to split an audio into segments with and without human voice in Python

I am trying to split audio into segments with and without human voice. I've started to use the method split from librosa and it does a really good job. The only problem I am having is to define the best threshold for silence.

This method has an argument top_db (in decibels) that considers everything below it as silence. Currently, I am using a hardcoded value of 40 dB. For some audio, it works fine but for others not so much.

Is there a way to discover the best top_db threshold for each audio signal? Maybe considering the signal amplitude or average dB. Or normalize the audio wave amplitude before processing it, so a given top_db can perform well in most of the audio.

So far I have the following code:

import librosa
import numpy as np

from pydub import AudioSegment


def to_normalized_array(audio_chunk, fs, librosa_fs):
   samples = audio_chunk.get_array_of_samples()
   arr = np.array(samples).astype(np.float32) / np.iinfo(np.int16).max
   return librosa.core.resample(arr, fs, librosa_fs)


audio_chunk = AudioSegment.from_wav("audio.wav")
audio_chunk = audio_chunk.set_sample_width(2).set_channels(1).set_frame_rate(16000)

fs = 16000
librosa_fs = 22050
top_db = 40

arr = to_normalized_array(audio_chunk, fs, librosa_fs)
edges = librosa.effects.split(arr, top_db=top_db) / librosa_fs

Thanks in advance,

Rhenan

Upvotes: 1

Views: 1697

Answers (1)

dankal444
dankal444

Reputation: 4148

  1. Calculate energies the same way as librosa does (based on _signal_to_frame_nonsilent and split functions
from librosa import feature
from librosa import core
mse = feature.rms(y=arr, frame_length=2048, hop_length=512) ** 2
mse_db = core.power_to_db(mse.squeeze(), ref=ref, top_db=None)
  1. Instead of average db get percentile, e.g. 10%, assuming silence takes at least 15% of the audio. Add to it some small number to take into account variance in noise level
percentile_parameter = 0.1     # [%]
extra_db_parameter = 5         # [dB]
threshold = numpy.percentile(mse_db, percentile_parameter ) + extra_db_parameter
  1. Provide this value as top_db
edges = librosa.effects.split(arr, top_db=threshold) / librosa_fs

Tweak those two parameters (percentile_parameter and extra_db_parameter to adjust to your case.

Upvotes: 2

Related Questions