matt-pielat
matt-pielat

Reputation: 1799

tf.contrib.signal.stft returns an empty matrix

This is the piece of code I run:

import tensorflow as tf

sess = tf.InteractiveSession()

filename = 'song.mp3' # 30 second mp3 file
SAMPLES_PER_SEC = 44100

audio_binary = tf.read_file(filename)

pcm = tf.contrib.ffmpeg.decode_audio(audio_binary, file_format='mp3', samples_per_second=SAMPLES_PER_SEC, channel_count = 1)
stft = tf.contrib.signal.stft(pcm, frame_length=1024, frame_step=512, fft_length=1024)

sess.close()

The mp3 file is properly decoded because print(pcm.eval().shape) returns:

(1323119, 1)

And there are even some actual non-zero values when I print them with print(pcm.eval()[1000:1010]):

[[ 0.18793298]
 [ 0.16214484]
 [ 0.16022217]
 [ 0.15918455]
 [ 0.16428113]
 [ 0.19858395]
 [ 0.22861415]
 [ 0.2347789 ]
 [ 0.22684409]
 [ 0.20728172]]

But for some reason print(stft.eval().shape) evaluates to:

(1323119, 0, 513) # why the zero dimension?

And therefore print(stft.eval()) is:

[]

According to this the second dimension of the tf.contrib.signal.stft output is equal to the number of frames. Why are there no frames though?

Upvotes: 0

Views: 513

Answers (1)

matt-pielat
matt-pielat

Reputation: 1799

It seems that tf.contrib.ffmpeg.decode_audio returned a tensor of shape (?, 1) which is one signal of ? samples.

However tf.contrib.signal.stft expects a (signal_count, samples) tensor as input, therefore one has to transpose it beforehand.

Modifying the call like this does the trick:

stft = tf.contrib.signal.stft(tf.transpose(pcm), frame_length=1024, frame_step=512, fft_length=1024)

Upvotes: 2

Related Questions