Reputation: 572
After using keras by implementing some examples and looking for tutorials I am kind of confused which cross entropy function I should use in my project. In my case I want to predict multiple labels such as (positive, negative and neutral) for online comments with a LSTM model. The labels have been converted to one-hot vectors with the to_categorical method in keras, which is also documented in keras:
(...) when using the categorical_crossentropy loss, your targets should be in categorical format (e.g. if you have 10 classes, the target for each sample should be a 10-dimensional vector that is all-zeros expect for a 1 at the index corresponding to the class of the sample).
The one-hot array looks as follow:
array([[1., 0., 0.],
[1., 0., 0.],
[0., 0., 1.],
Because there a multiple labels I would prefer to use categorical_crossentropy. I also implemented a model with this criteria but the accuracy of this model was only above 20%. Using binary_crossentropy with a sigmoid function my accuracy have been reached to 80%. I am really confused, because some guys argued with the following statements:
the accuracy computed with the Keras method "evaluate" is just plain wrong when using binary_crossentropy with more than 2 labels
whereas other have already implemented high performanced model with binary crossentropy and multiple labels, which is kind of the same workflow.
We want probability of each class. So we are using sigmoid on final layer, which gives output in range 0 to 1. If our aim was to find the class, then we will have used softmax
So I just want to know if there are any problems if I would to choose the binary_crossentropy like in the following link to predict the outcome class.
Upvotes: 3
Views: 3488
Reputation: 2378
You confused multilabel and multiclass classification.
In multiclass, your classifier chooses one class from N other classes. Usually, the last layer in neural networks that do multiclass classification is a softmax layer. That means that each output row will sum up to 1 (it forms a probability distribution over these N classes).
Multilabel classification, on the other hand, consists of making a binary choice for N questions. It makes sense to use binary cross-entropy for that, since the way most neural network framework work makes it behave like you calculate average binary cross-entropy over these binary tasks. In neural networks that are multilabel classifiers, sigmoid is used as the last layer (Kaggle kernel you linked uses sigmoid as activation in the last layer).
Upvotes: 2