marcels93
marcels93

Reputation: 55

Using softmax as output function while using binary_crossentropy as loss function?

Currently I am training a model for binary classification. I liked the idea of having two probabilities (one for each of the existing classes) which add up to 1. So I used softmax in my output layer and have gotten very high accuracies (up to 99,5%) with also very low losses of 0,007. While researching a bit I found that binary crossentropy is the only real choice when training for a 2 dimensional classification problem.

Now I am getting confused if I have to use a classification_crossentropy as lossfunction when I want to use softmax. Could you help me to understand what should be used as loss function and activation function in a binary classification problem and why?

Heres my code:

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(10, input_dim=input_dim, activation='sigmoid'))
model.add(tf.keras.layers.Dense(10, activation='sigmoid'))
model.add(tf.keras.layers.Dense(2, activation='softmax'))

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Upvotes: 0

Views: 3325

Answers (2)

l4morak
l4morak

Reputation: 311

So, if every object can represent only one class then there is no difference between

model.add(Dense(1, activation='sigmoid'))
loss = tf.keras.losses.BinaryCrossentropy()

and

model.add(Dense(2, activation='softmax'))
loss = tf.keras.losses.CategoricalCrossentropy()

As mentioned here, binary crossentropy is just a case of categorical crossentropy.

Upvotes: 5

Ahx
Ahx

Reputation: 7985

  • The loss function is depending on the problem type.

    • For a binary classification problem -> binary_crossentropy

    • For a multi-class problem -> categoricol_crossentropy

    • For a text classification problem -> MSE loss is calculated.

  • The activation function is also depending on the problem type.

    • Generally, relu activation function is used, but for a binary classification problem sometimes tanh performs better.

I wouldn't suggest using sigmoid

For optimizer, generally, Adadelta performs better.

The reason for the suggestion is the accuracy metric. The aim is to reach high accuracy, therefore your model must be learning. There are no strict rules, but some methods have been proven to perform better.

Upvotes: 4

Related Questions