Reputation: 2811
I have been wondering why does “smart” phone need elaborate manual steps to trigger SOS, when it already has enough inputs to detect panic like mic, camera, gps, gyroscope, etc.
I found this model (padmalcom/wav2vec2-large-nonverbalvocalization-classification) that promise to detecting scream. When I ran it on a test screaming audio, I’m getting different result for every run.
Here is the script I’m using:
from transformers import Wav2Vec2ForSequenceClassification
model_name = "padmalcom/wav2vec2-large-nonverbalvocalization-classification"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
import librosa
audio_path = "scream_test.wav"
audio, sample_rate = librosa.load(audio_path, sr=48000)
from scipy.stats import zscore
audio = zscore(audio)
import torch
torch.manual_seed(42)
inputs = torch.tensor(audio).unsqueeze(0)
outputs = model(inputs)
predicted_class_index = torch.argmax(outputs.logits, dim=1).item()
labels = model.config.id2label
print(labels[predicted_class_index])
I see this warning before output:
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at padmalcom/wav2vec2-large-nonverbalvocalization-classification and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
What am I doing wrong?
There is another simple way to do this:
from transformers import pipeline
model_name = 'padmalcom/wav2vec2-large-nonverbalvocalization-classification'
classifier = pipeline('audio-classification', model=model_name)
print(classifier("scream_test.wav"))
that is also causing same error
Upvotes: 1
Views: 48