Reputation: 29711
I am performing a voice activity detection on the recorded audio file to detect speech vs non-speech portions in the waveform.
The output of the classifier looks like (highlighted green regions indicate speech):
The only issue I face here is making it work for a stream of audio input (for eg: from a microphone) and do real-time analysis for a stipulated time-frame.
I know PyAudio
can be used to record speech from the microphone dynamically and there a couple of real-time visualization examples of a waveform, spectrum, spectrogram, etc, but could not find anything relevant to carrying out feature extraction in a near real-time manner.
Upvotes: 19
Views: 33151
Reputation: 1
I recently found this question while looking for an answer to the same question, thanks for all the suggestions. I found 3 more detectors. picovoice is by far better then webrtc. speechbrain and nvidia don't support real-time which kinda sucks
-picovoice cobra: https://picovoice.ai/docs/cobra/ -speechbrain: https://speechbrain.readthedocs.io/en/latest/API/speechbrain.pretrained.interfaces.html#speechbrain.pretrained.interfaces.VAD -nvidia: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speech_classification/models.html
Upvotes: -1
Reputation: 13636
You should try using Python bindings to webRTC VAD from Google. It's lightweight, fast and provides very reasonable results, based on GMM modelling. As the decision is provided per frame, the latency is minimal.
# Run the VAD on 10 ms of silence. The result should be False.
import webrtcvad
vad = webrtcvad.Vad(2)
sample_rate = 16000
frame_duration = 10 # ms
frame = b'\x00\x00' * int(sample_rate * frame_duration / 1000)
print('Contains speech: %s' % (vad.is_speech(frame, sample_rate))
Also, this article might be useful for you.
UPDATE December 2022
As the topic still draws attention, I'd like to update my answer. SileroVAD is very fast and very accurate VAD that was released recently under MIT license.
Upvotes: 26
Reputation: 3728
I found out that LibROSA could be one of the solutions to your problem. There's a simple tutorial on Medium on using Microphone streaming to realise real-time prediction.
Let's use Short-Time Fourier Transform (STFT) as the feature extractor, the author explains:
To calculate STFT, Fast Fourier transform window size(n_fft) is used as 512. According to the equation n_stft = n_fft/2 + 1, 257 frequency bins(n_stft) are calculated over a window size of 512. The window is moved by a hop length of 256 to have a better overlapping of the windows in calculating the STFT.
stft = np.abs(librosa.stft(trimmed, n_fft=512, hop_length=256, win_length=512))
# Plot audio with zoomed in y axis
def plotAudio(output):
fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,10))
plt.plot(output, color='blue')
ax.set_xlim((0, len(output)))
ax.margins(2, -0.1)
plt.show()
# Plot audio
def plotAudio2(output):
fig, ax = plt.subplots(nrows=1,ncols=1, figsize=(20,4))
plt.plot(output, color='blue')
ax.set_xlim((0, len(output)))
plt.show()
def minMaxNormalize(arr):
mn = np.min(arr)
mx = np.max(arr)
return (arr-mn)/(mx-mn)
def predictSound(X):
clip, index = librosa.effects.trim(X, top_db=20, frame_length=512, hop_length=64) # Empherically select top_db for every sample
stfts = np.abs(librosa.stft(clip, n_fft=512, hop_length=256, win_length=512))
stfts = np.mean(stfts,axis=1)
stfts = minMaxNormalize(stfts)
result = model.predict(np.array([stfts]))
predictions = [np.argmax(y) for y in result]
print(lb.inverse_transform([predictions[0]])[0])
plotAudio2(clip)
CHUNKSIZE = 22050 # fixed chunk size
RATE = 22050
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paFloat32, channels=1,
rate=RATE, input=True, frames_per_buffer=CHUNKSIZE)
#preprocessing the noise around
#noise window
data = stream.read(10000)
noise_sample = np.frombuffer(data, dtype=np.float32)
print("Noise Sample")
plotAudio2(noise_sample)
loud_threshold = np.mean(np.abs(noise_sample)) * 10
print("Loud threshold", loud_threshold)
audio_buffer = []
near = 0
while(True):
# Read chunk and load it into numpy array.
data = stream.read(CHUNKSIZE)
current_window = np.frombuffer(data, dtype=np.float32)
#Reduce noise real-time
current_window = nr.reduce_noise(audio_clip=current_window, noise_clip=noise_sample, verbose=False)
if(audio_buffer==[]):
audio_buffer = current_window
else:
if(np.mean(np.abs(current_window))<loud_threshold):
print("Inside silence reign")
if(near<10):
audio_buffer = np.concatenate((audio_buffer,current_window))
near += 1
else:
predictSound(np.array(audio_buffer))
audio_buffer = []
near
else:
print("Inside loud reign")
near = 0
audio_buffer = np.concatenate((audio_buffer,current_window))
# close stream
stream.stop_stream()
stream.close()
p.terminate()
Code credit to: Chathuranga Siriwardhana
Full code can be found here.
Upvotes: 5
Reputation: 74
I think there are two approaches here,
The first one is fast, feasible and can be implemented and tested very fast. while the second one is a bit more difficult to implement. I think you are a bit familiar with 2nd option already.
in the case of the 2nd approach, you will be needing a dataset of speeches that are labeled in a sequence of binary classification like 00000000111111110000000011110000
. The neural net should be small and optimized for running on edge devices like mobile.
You can check this out from TensorFlow
This is a voice activity detector. I think it's for your purpose.
Also, check these out.
https://github.com/eesungkim/Voice_Activity_Detector
https://github.com/pyannote/pyannote-audio
of course, you should compare performance of the mentioned toolkits and models and the feasibility of the implementation of mobile devices.
Upvotes: 3
Reputation: 2860
Audio usually has a low bitrate, so I don't see any problem of writing your code completely in numpy
and python
. And if you need low-level array access consider numba
. Also profile your code e.g. with line_profiler
. ALso note there is scipy.signal
for more advanced signal processing.
Usually audio processing works in samples. So you define a sample size for your process, and then run a method to decide if that sample contains speech or not.
import numpy as np
def main_loop():
stream = <create stream with your audio library>
while True:
sample = stream.readframes(<define number of samples / time to read>)
print(is_speech(sample))
def is_speech(sample):
audio = np.array(sample)
< do you processing >
# e.g. simple loudness test
return np.any(audio > 0.8):
That should get you pretty far.
Upvotes: 3