koko
koko

Reputation: 23

Binary classification of audio .wav files

Hey I’m total Layman in case od audio processing so my question will be very basic. I have audio from 2 groups X and Y with .wav audio samples and I need to make model which will correctly classify is the sound X or Y. I founded how to load data into list, than I converted it to Dataframe I have 2 columns(in second one there is 8000 elements in each row).

       0    1
0   2000    [0.1329449, 0.14544961, 0.19810106, 0.21718721...
1   2000    [-0.30273795, -0.6065889, -0.4967722, -0.47117...
2   2000    [-0.07037315, -0.6685449, -0.48479277, -0.4535...

I founded those useful features from python_speech_features module so far:

 rate,signal = sw.read(i)
    features = psf.base.mfcc(signal)
    features = psf.base.fbank(features)
    features = psf.base.logfbank(features[1])
    features = psf.base.lifter(features,L=22)
    features = psf.base.delta(features,N=13)
    features = pd.DataFrame(features)
  1. What kind of other features should I extract from audio files?
  2. What is worth to visualize here to unveil some patterns? eg. can I visulize some feature who can show difference between A and B?
  3. What is the best way to make this classification, is it better to do them with NN or traditional models will satisfy?

I will appreciate all kind of help Additional resources for self-learning will be highly welcome as well.

Upvotes: 1

Views: 1714

Answers (1)

pietz
pietz

Reputation: 2533

I've had great success in converting audio files to melspectrograms and using a basic CNN to classify the images. The following function requires the librosa library:

def audio_to_image(path, height=192, width=192):
    signal, sr = lr.load(path, res_type='kaiser_fast')
    hl = signal.shape[0]//(width*1.1)
    spec = lr.feature.melspectrogram(signal, n_mels=height, hop_length=int(hl))
    img = lr.power_to_db(spec)**2
    start = (img.shape[1] - width) // 2
    return img[:, start:start+width]
  1. Load the audio file
  2. Make the hop length 10% longer than the specified width
  3. Create a melspectrogram from the audio signal
  4. Log scale the amplitude similar to human hearing
  5. Cut away 5% from beginning and end to handle silence

The result will look something like this:

melspectrogram

While there is little human intuition behind these images, CNNs can classify them fairly well. Play a little with different resolutions and settings. Let me know how this works for you.

EDIT: Here is the full code of my own project classifying audio samples of speech to their spoken language.

Upvotes: 2

Related Questions