How do I can reconstructing stft to audio?

Question

In order to train an autoencoder model using audio data as a first step, I need to understand the different representations of audio found in the literature, such as STFT(not spectrogram i mean by stft "the stft coefficients") , spectrogram, MFCC, etc. For this reason, I want to convert a .wav file into STFT (not a spectrogram) or any other representation, and then reverse that representation to reconstruct it back into a .wav file. I will listen to the reconstructed audio to assess if it is too degraded or not, and to determine if the chosen parameters for the representation are appropriate before use it with the model. I am currently attempting to implement the code, but I need some help on how to proceed with the reconstruction phase. how I can implement this in python. Below is the code I have attempted to implement, but without the reconstruction phase.


import os
import librosa
import librosa.display
import IPython.display as ipd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf 
scale_file = "/content/0-m-21-0-1-105.wav"
y,sr = librosa.load(scale_file, sr=16000)
###### waveform #######
plt.figure(figsize=(12,5))
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.title("Waveform")
plt.plot(y)
plt.show()
##### STFT #######
n_fft=1024 # window_length
hop_length=512
window_type ='hann'
sample_rate = 16000
# calculate duration hop length and window in seconds
hop_length_duration = float(hop_length)/sample_rate
n_fft_duration = float(n_fft)/sample_rate
print(f'“STFT hop length duration is:{hop_length_duration}s ”')
print(f'“STFT window duration is: {n_fft_duration}s ”.')

stft_lib = librosa.stft(y, n_fft=n_fft, 
                               hop_length=hop_length, 
                               win_length=n_fft,
                               window=window_type)

### spectrogram ####
spectrogram = np.abs(stft_lib)
#X_inv = librosa.griffinlim(np.abs(stft_lib))
log_spectrogram = librosa.amplitude_to_db(spectrogram)
librosa.display.specshow(log_spectrogram, sr=sample_rate,hop_length=hop_length)
plt.xlabel("Time")
plt.ylabel("Frequency")
plt.colorbar(format="%+2.0f dB")
plt.title("Spectrogram (dB)")
plt.savefig("spectogram_log.png")
plt.show()

##### mel spectrogram ####

mel_spect = librosa.feature.melspectrogram(y=y, 
                                           sr = sr, 
                                           n_fft=n_fft,
                                           n_mels = 20,
                                           hop_length=hop_length,
                                           window='hann')
log_mel_spect = librosa.amplitude_to_db(mel_spect)
librosa.display.specshow(log_mel_spect, sr=sample_rate,x_axis ='time', y_axis='mel',hop_length=hop_length)

plt.colorbar(format="%+2.0f dB")
plt.title("Log Mel spectrogram")
plt.tight_layout()
plt.savefig("Log Mel spectrogram.png")
plt.show()

I am attempting to train an autoencoder using audio data for a steganography application, and I am currently adjusting the parameters of the data representation before feeding it into the model. To achieve this, I am trying to convert the audio into a representation, then reconstruct it and listen to it. This approach helps me avoid the impact of choosing an inappropriate representation on the training outcome.

How do I can reconstructing stft to audio?

Answers (1)

Related Questions