Reputation: 21
I am trying to update the feature extraction pipeline of an speech command recognition model replacing the function audio_ops.audio_spectrogram()
by tf.contrib.signal.stft()
. I assumed that they were equivalent, but I am obtaining different spectrogram values with the same input audio. Could someone explain the relation between the two methods, or whether it is possible to obtain the same results using tf.contrib.signal.stft()
?
My code:
1) audio_ops
method:
from tensorflow.contrib.framework.python.ops import audio_ops
import tensorflow as tf
import numpy as np
from tensorflow.python.ops import io_ops
#WAV audio loader
wav_filename_placeholder_ = tf.placeholder(tf.string, [], name='wav_filename')
wav_loader = io_ops.read_file(wav_filename_placeholder_)
sample_rate = 16000
desired_samples = 16000 #1 sec audio
wav_decoder = audio_ops.decode_wav(wav_loader, desired_channels=1, desired_samples=desired_samples)
#Computing the spectrograms
spectrogram = audio_ops.audio_spectrogram(wav_decoder.audio,
window_size=320,
stride=160,
magnitude_squared=False)
with tf.Session() as sess:
feed_dict={wav_filename_placeholder_:"/<folder_path>/audio_sample.wav"}
#Get the input audio and the spectrogram
audio_ops_wav_decoder_audio, audio_ops_spectrogram = sess.run([wav_decoder.audio, spectrogram], feed_dict)
2) tf.contrib.signal
method:
#Input WAV audio (will be initialized with the same audio signal: wav_decoder.audio )
signals = tf.placeholder(tf.float32, [None, None])
#Compute the spectrograms and get the absolute values
stfts = tf.contrib.signal.stft(signals,
frame_length=320,
frame_step=160,
fft_length=512,
window_fn=None)
magnitude_spectrograms = tf.abs(stfts)
with tf.Session() as sess:
feed_dict = {signals : audio_ops_wav_decoder_audio.reshape(1,16000)}
tf_original, tf_stfts, tf_spectrogram, = sess.run([signals, stfts, magnitude_spectrograms], feed_dict)
Thank you in advance
Upvotes: 2
Views: 2642
Reputation: 73
Found these helpful comments in github that discuss the differences:
https://github.com/tensorflow/tensorflow/issues/11339#issuecomment-345741527
https://github.com/tensorflow/tensorflow/issues/11339#issuecomment-443553788
You can think of audio_ops.audio_spectrogram and audio_ops.mfcc as "fused" ops (like fused batch-norm or fused LSTM cells that TensorFlow has) for the ops in tf.contrib.signal. I think the original motivation of them was that a fused op makes it easier to provide mobile support. I think long term it would be nice if we removed them and provided automatic fusing via XLA, or unified the API to match tf.contrib.signal API, and provided fused keyword arguments to tf.contrib.signal functions, like we do for tf.layers.batch_normalization.
audio_spectrogram is a C++ implementation of an STFT, while tf.signal.stft uses TensorFlow ops to compute the STFT (and thus has CPU, GPU and TPU support).
The main cause of difference between them is that audio_spectrogram uses fft2d to compute FFTs while tf.contrib.signal.stft uses Eigen (CPU), cuFFT (GPU), and XLA (TPU). There is another very minor difference, which is that the default periodic Hann window used by each is slightly different. tf.contrib.signal.stft follows numpy/scipy's definition.
Upvotes: 4