Reputation: 501
Currently I am working on a project that requires me to pick out audio clips and compare them based off their FFT results (i.e. spectrogram). All of my audio clips are 0.200s long, but when I process them through the transform, they are no longer the same length. The code I am using for the transform uses numpy and librosa libraries:
def extractFFT(audioArr):
fourierArr = []
fourierComplex = []
for x in range(len(audioArr)):
y, sr = lb.load(audioArr[x])
fourier = np.fft.fft(y)
fourier = fourier.real
fourierArr.append(fourier)
return fourierArr
I am only taking the real number portion of the transform because I also wanted to pass this through a PCA, which does not allow for complex numbers. Regardless, I can perform neither LDA (linear discriminant analysis) or PCA on this FFT array of audio clips, since some are of different lengths.
The code I have for the LDA is as follows, where the labels are given for a frequencyArr
of length 4:
def LDA(frequencyArr):
splitMark = int(len(frequencyArr)*0.8)
trainingData = frequencyArr[:splitMark]
validationData = frequencyArr[splitMark:]
labels = [1,1,2,2]
lda = LinearDiscriminantAnalysis()
lda.fit(trainingData,labels[:splitMark])
print(f"prediction: {lda.predict(validationData)}")
This throws the following value error, coming from the lda.fit(trainingData,labels[:splitMark])
line:
ValueError: setting an array element with a sequence.
I know this error stems from the array not being of a set 2 dimensional shape, since I don't receive this error when the FFT elements are all of equal length and the code works as intended.
Does this have something to do with the audio clips? After the transform, some audio clips are of equal lengths, others are not. If someone could explain why these same length audio clips can return different length FFT's, that would be great!
Note, they normally only differ by a few points, say for 3 of the audio clips the FFT length is 4410 but for the 4th it is 4409. I know I can probably just trim the lengths down to the smallest length out of the group, but I'd prefer a cleaner method that won't leave out any values.
Upvotes: 1
Views: 1059
Reputation: 5310
First of all: Do not only take the real part of the transform result. It won't do you any good. Use the power (r^2+i^2
) or magnitude (sqrt(power)
) to get the strength of the signal for a frequency bin.
Does this have something to do with the audio clips? After the transform, some audio clips are of equal lengths, others are not. If someone could explain why these same length audio clips can return different length FFT's, that would be great!
They are simply not the same length. I bet the sample number of your clips isn't exactly identical.
After y, sr = lb.load(audioArr[x])
do print('sample count = {}'.format(len(y)))
and you will most likely see different values (you've stated as much yourself).
As you already point out, of course you could simply cut of the signal at min(len(y))
and then feed it into the FFT. But typically, what you do to get around this is to use a discrete STFT, which has a fixed window size. This ensures same length input size to the FFT. You can use librosa's implementation as an easy starting point. The docs also explain how to get magnitude/power.
So instead of:
y, sr = lb.load(audioArr[x])
fourier = np.fft.fft(y)
fourier = fourier.real
fourierArr.append(fourier)
You do:
y, sr = lb.load(audioArr[x])
# get the magnitudes
D = np.abs(librosa.stft(y, n_fft=4096)) # use 4096 as window length
fourierArr.append(D[0]) # only use the first frame of the STFT
In essence, if you use the Fourier transform with different length input, you will get different length output, which is something that LDA does not forgive, when using this output as training data. So you have to make sure your input has the same length. The easiest way to do this is to use the STFT (or simply cut all your input to min
). IMO, there is nothing unclean about this and it will not affect results much, if you are missing a couple of samples.
Upvotes: 3