kacpo1
kacpo1

Reputation: 565

Spectrogram in python using numpy

I need to make spectrogram using numpy. I take 1s of audio and split it into 0.02s chunks. Then I calculate FFT using numpy and put it back together into one image. Results are poor.

Here is spectrogram generated using matplotlib specgram function:

enter image description here

And here is my 'spectrogram':

enter image description here

Here is my code:

spect_frags = []
transform = []

for x in range(0, 8000, 160):
  spect_frags.append(spect_sample[x:x + 160])

for sample in spect_frags:
  transform.append(abs(np.fft.fft(sample).real)[0:np.fft.fft(sample).real.size//4])

I cut out 3/4 of frequencies beacuse I don't need them for now. I have no clue why there is so much diffrence in resolution. How can I improve it?

Upvotes: 2

Views: 14244

Answers (2)

Kavorite
Kavorite

Reputation: 41

For efficient use with JAX I found it useful to adapt @jlandercy's solution to avoid explicit loops and to add some simple Hann windowing. I gave up support for stereo input in the process, although I'm fairly certain that the whole point of computing STFT is that the nonlinearity of Fourier transforms means I'm destroying some information by collapsing stereo signals to mono before carrying this out.

I also did not extend their answer to include overlapping windows in the interest of preserving its simplicity and concision. There's probably some numpy sorceress out there who knows how to do efficient np.concatenate to stitch together or interpolate slices of nonoverlapping signal samples from the innermost dimensions of wins or some such. Regrettably, I myself have no such knowledge of the dark arts, so that functionality has been elided.

import matplotlib.pyplot as plt
import jax.numpy as jnp

def stft(a, n_fft=128, window=jnp.hanning):  
  n = n_fft
  rpad = n - a.shape[-1] % n
  wins = jnp.pad(a, (0, rpad)).reshape(-1, n) * window(n)
  fftc = jnp.fft.fftshift(jnp.fft.fft(wins, n=n))[..., n // 2 : n]
  fftr = jnp.real(fftc * jnp.conj(fftc))
  return fftr

audio =  # buffer single-channel floating-point samples from somewhere...
_ = plt.imshow(stft(audio, 1024).T[:, -512:], cmap="viridis")
demo spectrogram

Upvotes: 2

jlandercy
jlandercy

Reputation: 11072

Spectrogram MCVE

You can recreate a crude estimate of specgram with the following code:

import numpy as np
from scipy.io import wavfile
from scipy import fft
import matplotlib.pyplot as plt

# Read some sample file (replace with your data):
rate, data = wavfile.read('./data/aaaah.wav')
# rate=48000, data.shape=(46447, 2) ~ almost 1s of stereo signal

# Spectrogram estimation:
N = 256
S = []
for k in range(0, data.shape[0]+1, N):
    x = fft.fftshift(fft.fft(data[k:k+N,0], n=N))[N//2:N]
    # assert np.allclose(np.imag(x*np.conj(x)), 0)
    Pxx = 10*np.log10(np.real(x*np.conj(x)))
    S.append(Pxx)
S = np.array(S)

# Frequencies:
f = fft.fftshift(fft.fftfreq(N, d=1/rate))[N//2:N]
# array([    0. ,   187.5,   375. , ..., 23625. , 23812.5])

# Spectrogram rendering:
plt.imshow(S.T, origin='lower')

It outputs:

enter image description here

When specgram renders:

_ = plt.specgram(data[:,0])

enter image description here

This MCVE differs from specgram because axes should be scaled to properly reflect time and frequencies and there is no moving windowing. More precisely:

  • x-axis stands for time chunck index of length N=256;
  • y-axis is positive half plane FFT index (N//2=128), notice the use of fftshift to assemble the spectrum after fft;
  • real frequencies are available using sampling rate and fftfreq, in specgram it ranges from 0 to 1 as this method is not necessarily aware of the signal sampling rate;
  • there is no window overlap (independent contiguous chuncks are used), this is why the MCVE is a bit less smooth than specgram.

Power estimate

Also notice than taking real part of complex number is not the same as taking the magnitude. Mainly, when you write:

abs(np.fft.fft(sample).real)

You are not taking the norm of complex number, but you totally remove the complex part because of the .real call.

You should estimate the power using product of conjugates:

10*np.log10(np.real(x*np.conj(x)))

Then use abs to transform complex type (or just keep the real part as the complex part must be null) into float. Finally, you can scale in Decibel using decimal logarithm.

Sanity check

You can check the result of FFT is indeed of complex type with a significative complex part (removing it leads to information loss):

x
# array([-1.56000000e+02-0.00000000e+00j, -3.94271344e+01+1.17935735e+02j,
#         4.03754070e+01+4.14695163e+01j,  1.71510716e+01+1.26920718e+01j,
#         2.15523795e+01-2.07362424e+00j, -3.03847433e+00-1.22767815e+01j,
#        -4.56347533e+00-7.36380957e-01j, -1.28048283e+01-6.80931256e+00j,
#        -2.22781473e+01+1.12096897e+01j, -1.13788549e+01+2.54314337e+01j,
#        ...])

And the product of conjugates do have a null complex part (but is still of complex type):

x*np.conj(x)
# array([2.43360000e+04+0.j, 1.54633365e+04+0.j, 3.34989427e+03+0.j,
#        4.55247945e+02+0.j, 4.68804979e+02+0.j, 1.59951690e+02+0.j,
#        2.13675640e+01+0.j, 2.10330365e+02+0.j, 6.21972990e+02+0.j,
#        7.76236159e+02+0.j, 1.05846430e+03+0.j, 6.54663598e+02+0.j,
#        6.95792718e+01+0.j, 6.03013130e+01+0.j, 1.11620428e+01+0.j,
#        ...])

You can ensure this is always true (sanity check), by asserting the following:

assert np.allclose(np.imag(x*np.conj(x)), 0)

Upvotes: 5

Related Questions