What gives the logits this unexpected shape?

Question

I am currently developing an audio classifier with the Python API of TensorFlow, using the UrbanSound8K dataset, collecting exactly 176400 data points from each file and trying to distinguish between 10 mutually exclusive classes.

I have adapted this example code for a convolutional neural net: https://www.tensorflow.org/get_started/mnist/pros

Unfortunately, I am getting the following errors:

Traceback (most recent call last):
  ...
tensorflow.python.framework.errors_impl.InvalidArgumentError: logits and labels must have the same first dimension, got logits shape [7000,10] and labels shape [10]
     [[Node: xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"](read/add, _recv_y_0/_9)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "urban-cnn.py", line 124, in 
    sess.run(optimizer, feed_dict={x: batch_x, y: batch_y, keep_prob: .5})
  ...
tensorflow.python.framework.errors_impl.InvalidArgumentError: logits and labels must have the same first dimension, got logits shape [7000,10] and labels shape [10]
     [[Node: xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"](read/add, _recv_y_0/_9)]]

Caused by op 'xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits', defined at:
  File "urban-cnn.py", line 102, in 
    xent = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=y_conv), name="xent")
  ...

InvalidArgumentError (see above for traceback): logits and labels must have the same first dimension, got logits shape [7000,10] and labels shape [10]
     [[Node: xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"](read/add, _recv_y_0/_9)]]

Here is a slightly edited version of the code:

import tensorflow as tf
import soundfile as sfx
import numpy as np
import math
import glob

batch_size = 10
n_epochs = 10

input_width = 176400

n_labels = 10

widths = [5, 5, 7]
channels = [1, 8, 64, 512, n_labels]

learning_rate = 1e-4

def load_data():
    data_x = []
    data_y = []

    for path in glob.glob("./UrbanSound8K/audio/fold1/*.wav"):
        name = path.split("/")[-1].split(".")[0]
        x, sample_rate = sfx.read(path, frames=input_width, fill_value=0.)
        y = int(name.split("-")[1])

        if x.ndim > 1:
            x = x.take(0, axis=1)

        data_x.append(x)
        data_y.append(y)

    return data_x, data_y

data_x, data_y = load_data()
data_split = int(len(data_x) * .9)

train_x = data_x[:data_split]
train_y = data_y[:data_split]

test_x = data_x[data_split:]
test_y = data_y[data_split:]

x = tf.placeholder(tf.float32, [None, input_width], name="x")
y = tf.placeholder(tf.int64, [None], name="y")

x_reshaped = tf.reshape(x, [-1, 1, input_width, channels[0]], name="x_reshaped")

def weights_x(shape, name):
    w = tf.Variable(tf.truncated_normal(shape, stddev=0.1), name=name)
    tf.summary.histogram("weights", w)
    return w

def weights(layer, name):
    return weights_x([1, widths[layer], channels[layer], channels[layer+1]], name)

def biases(layer, name):
    b = tf.Variable(tf.constant(0.1, shape=[channels[layer+1]]), name=name)
    tf.summary.histogram("biases", b)
    return b

def convolution(p, w, b, name):
    c = tf.nn.relu(tf.nn.conv2d(p, w, strides=[1, 1, 1, 1], padding="SAME") + b, name=name)
    tf.summary.histogram("convolution", c)
    return c

def pooling(c, name):
    p = tf.nn.max_pool(c, ksize=[1, 1, 6, 1], strides=[1, 1, 6, 1], padding="SAME", name=name)
    tf.summary.histogram("pooling", p)
    return p

with tf.name_scope("conv1"):
    w1 = weights(0, "w1")
    b1 = biases(0, "b1")
    c1 = convolution(x_reshaped, w1, b1, "c1")
    p1 = pooling(c1, "p1")

with tf.name_scope("conv2"):
    w2 = weights(1, "w2")
    b2 = biases(1, "b2")
    c2 = convolution(p1, w2, b2, "c2")
    p2 = pooling(c2, "p2")

with tf.name_scope("dens"):
    n_edges = widths[2] * channels[2]
    wf1 = weights_x([n_edges, channels[3]], "wf1")
    bf1 = biases(2, "bf1")
    pf1 = tf.reshape(p2, [-1, n_edges], name="pf1")
    f1 = tf.nn.relu(tf.matmul(pf1, wf1) + bf1, name="f1")

with tf.name_scope("drop"):
    keep_prob = tf.placeholder(tf.float32, name="keep_prob")
    dropout = tf.nn.dropout(f1, keep_prob)

with tf.name_scope("read"):
    wf2 = weights_x([channels[3], channels[4]], "wf2")
    bf2 = biases(3, "bf2")
    y_conv = tf.matmul(dropout, wf2) + bf2

with tf.name_scope("xent"):
    xent = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=y_conv), name="xent")
    tf.summary.scalar("xent", xent)

with tf.name_scope("optimizer"):
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(xent)

with tf.name_scope("accuracy"):
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), y, name="correct_prediction")
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="accuracy")
    tf.summary.scalar("accuracy", accuracy)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    print("Initialized Global Variables")

    for epoch in range(n_epochs):
        n_itr = len(train_x)//batch_size

        for itr in range(n_itr):
            left, right = itr*batch_size, (itr+1)*batch_size
            batch_x, batch_y = train_x[left:right], train_y[left:right]

            sess.run(optimizer, feed_dict={x: batch_x, y: batch_y, keep_prob: .5})
        print("epoch: ", epoch + 1)

    print("accuracy: ", sess.run(accuracy, feed_dict={x: test_x, y: test_y, keep_prob: 1.}))

When inspecting the Tensor shapes before calling sess.run(...) then everything is as expected.

So why do the logits have the shape [7000, n_labels] instead of [batch_size, n_labels]?

lejlot · Accepted Answer

Your network has incorrect structure, the crucial problem is here

with tf.name_scope("dens"):
    n_edges = widths[2] * channels[2]
    wf1 = weights_x([n_edges, channels[3]], "wf1")
    bf1 = biases(2, "bf1")
    pf1 = tf.reshape(p2, [-1, n_edges], name="pf1")
    f1 = tf.nn.relu(tf.matmul(pf1, wf1) + bf1, name="f1")

p2 has a shape [10, 1, 4900, 64] and n_edges is not equal to 4900 * 64 = 313600 but rather it is 448 (way too small layer!), if you make n_edges = 313600 everything is fine, however it is up to you whether this is the architecture you had in mind. It looks like you merged two non-compatible things, you used shape of the convolution kernel to compute how big is the layer to flatten it. However this is not how convolution works - shape of the layer depends on the size of the input and kernel and padding. Thus in general it is way bigger, and as in this example - fully connected layer actually should have over 300k input neurons, and not as in your code - just 448. The crucial distinction here is that this fully connected layer works on the output of the convolution, not on the parameters.

This 7000 is just a result of operation batch_size * (4900 * 64) / (n_edges) = 10 * 313600 / 448 = 7000 (reshaping of pf1).

The more generic fix is to do

p2s = p2.get_shape()
n_edges = int(p2s[1] * p2s[2] * p2s[3])

since at this point all the shapes of p2 (besides 0th) are known thus can be read and used for the construction of the reminder of the network.

What gives the logits this unexpected shape?

Answers (1)

Related Questions