Reputation: 493
I am trying to split audio into segments with and without human voice. I've started to use the method split from librosa and it does a really good job. The only problem I am having is to define the best threshold for silence.
This method has an argument top_db
(in decibels) that considers everything below it as silence. Currently, I am using a hardcoded value of 40 dB. For some audio, it works fine but for others not so much.
Is there a way to discover the best top_db
threshold for each audio signal? Maybe considering the signal amplitude or average dB. Or normalize the audio wave amplitude before processing it, so a given top_db
can perform well in most of the audio.
So far I have the following code:
import librosa
import numpy as np
from pydub import AudioSegment
def to_normalized_array(audio_chunk, fs, librosa_fs):
samples = audio_chunk.get_array_of_samples()
arr = np.array(samples).astype(np.float32) / np.iinfo(np.int16).max
return librosa.core.resample(arr, fs, librosa_fs)
audio_chunk = AudioSegment.from_wav("audio.wav")
audio_chunk = audio_chunk.set_sample_width(2).set_channels(1).set_frame_rate(16000)
fs = 16000
librosa_fs = 22050
top_db = 40
arr = to_normalized_array(audio_chunk, fs, librosa_fs)
edges = librosa.effects.split(arr, top_db=top_db) / librosa_fs
Thanks in advance,
Rhenan
Upvotes: 1
Views: 1697
Reputation: 4148
from librosa import feature
from librosa import core
mse = feature.rms(y=arr, frame_length=2048, hop_length=512) ** 2
mse_db = core.power_to_db(mse.squeeze(), ref=ref, top_db=None)
percentile_parameter = 0.1 # [%]
extra_db_parameter = 5 # [dB]
threshold = numpy.percentile(mse_db, percentile_parameter ) + extra_db_parameter
edges = librosa.effects.split(arr, top_db=threshold) / librosa_fs
Tweak those two parameters (percentile_parameter
and extra_db_parameter
to adjust to your case.
Upvotes: 2