Reputation: 1711
I'm following this tutorial for tensorflow:
It describes the implementation of the cross entropy function as:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
First, tf.log computes the logarithm of each element of y. Next, we multiply each element of y_ with the corresponding element of tf.log(y). Then tf.reduce_sum adds the elements in the second dimension of y, due to the reduction_indices=1 parameter. Finally, tf.reduce_mean computes the mean over all the examples in the batch.
It is my understanding that both the actual and predicted values of y, from reading the tutorial, are 2D tensors. The rows are the number of MNIST vectors that you use of size 784 which represents the columns.
The quote above says that "we multiply each element of y_ with the corresponding element of tf.log(y)".
My question is - are we doing traditional matrix multiplication here i.e row x column because the sentence suggests that we are not?
Upvotes: 0
Views: 326
Reputation: 2982
The traditional matrix multiplication is only used when calculating the model hypothesis
as seen in the code to multiply x
by W
:
y = tf.nn.softmax(tf.matmul(x, W) + b)
The code y_ * tf.log(y)
in the code block:
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y),
reduction_indices=[1]))
performs an element-wise multiplication of the original targets => y_
with the log of the predicted targets => y
.
The goal of calculating the cross-entropy loss function is to find the probability that an observation belongs to a particular class or group in the classification problem.
It is this measure (i.e., the cross-entropy loss) that is minimized by the optimization function of which Gradient Descent is a popular example to find the best set of parameters for W
that will improve the performance of the classifier. We say the loss is minimized because the lower the loss or cost of error, the better the model.
Upvotes: 1