Reputation: 1708
I currently have a few thousand audio clips that I need to classify with machine learning.
After some digging I found that if you do a short time fourier transform on the audio, it turns into a 2 dimensional image so I can use various image classification algorithms on these images instead of the audio files themselves.
To this end I found a python package that does the STFT and all I need is to plot it so I can get the images. For plotting I found this github repo very useful.
Finally my code ended up as this:
import stft
import scipy
import scipy.io.wavfile as wav
import matplotlib.pylab as pylab
def save_stft_image(source_filename, destination_filename):
fs, audio = wav.read(source_filename)
X = stft.spectrogram(audio)
print X.shape
fig = pylab.figure()
ax = pylab.Axes(fig, [0,0,1,1])
ax.set_axis_off()
fig.add_axes(ax)
pylab.imshow(scipy.absolute(X[:][:][0].T), origin='lower', aspect='auto', interpolation='nearest')
pylab.savefig(destination_filename)
save_stft_image("Example.wav","Example.png")
The code works, however I observed that when print X.shape
line executes I get (513L, 943L, 2L)
. So the result is 3 dimensional. So when I only write X[:][:][0]
or X[:][:][1]
I get an image.
I keep reading this "redundancy" STFT has, that you can remove the half because you would not need it. Is that 3rd dimension that redundancy or am I doing something very wrong here? If so how do I properly plot it?
Thank you.
Edit: So the new code and output is:
import stft
import os
import scipy
import scipy.io.wavfile as wav
import matplotlib.pylab as pylab
def save_stft_image(source_filename, destination_filename):
fs, audio = wav.read(source_filename)
audio = scipy.mean(audio, axis = 1)
X = stft.spectrogram(audio)
print X.shape
fig = pylab.figure()
ax = pylab.Axes(fig, [0,0,1,1])
ax.set_axis_off()
fig.add_axes(ax)
pylab.imshow(scipy.absolute(X.T), origin='lower', aspect='auto', interpolation='nearest')
pylab.savefig(destination_filename)
save_stft_image("Example.wav","Example.png")
On the left I get an almost invisible column of colors. The sounds I am working on are respiratory sounds, so they have very low frequencies. Maybe that's why the visualization is a very thin column of colors.
Upvotes: 1
Views: 3602
Reputation: 3930
You probably have an stereo audio file? So X[:][:][0]
and X[:][:][1]
correspond to each channel.
You can convert multichannel to mono by scipy.mean(audio, axis=1)
.
Upvotes: 1