Understanding when to and when not to use Softmax as output layer activation

Question

So I just started working with neural nets and set out to make a basic image classification network with binary labels. From my understanding of neural nets, I thought that the purpose of having the Softmax activation function in the output layer was to convert the incoming information into probabilities of the labels with the predicted label being the one with the higher probability. So my first question is -

Why and how is my model able to increase its accuracy and predict reasonably well without the softmax activation function in the output layer? (haven't added output pictures here but it's true)
Secondly I am observing this weird behavior when I do in fact add the Softmax as an activation function in the output layer. My validation accuracy gets stuck on 0.5 for all epochs. (Training accuracy stuck at 0.813).

I am pretty sure this is some obvious issue that escapes me regarding the network architecture and the various hyperparameters that I use. Would be grateful for your help! I am pasting my code below for you to take a look, haven't put the out but let me know if you need that too.

#Train Data
INPUT_FOLDER = '../input/chest-xray-pneumonia/chest_xray/train/NORMAL'
images = os.listdir(INPUT_FOLDER)
X_train_1 = []
for instance in images:
    image = Image.open('../input/chest-xray-pneumonia/chest_xray/train/NORMAL/' + instance)
    image_rz = image.resize((100,100)).convert('L')
    array = np.array(image_rz)
    X_train_1.append(array)
X_train_1 = np.array(X_train_1)
print(X_train_1.shape)

INPUT_FOLDER = '../input/chest-xray-pneumonia/chest_xray/train/PNEUMONIA'
images = os.listdir(INPUT_FOLDER)
X_train_2 = []
for instance in images:
    image = Image.open('../input/chest-xray-pneumonia/chest_xray/train/PNEUMONIA/' + instance)
    image_rz = image.resize((100,100)).convert('L')
    array = np.array(image_rz)
    X_train_2.append(array)
X_train_2 = np.array(X_train_2)
print(X_train_2.shape)
X_trn = np.concatenate((X_train_1, X_train_2))
print(X_trn.shape)

#Make Labels
y_trn = np.zeros(5216, dtype = str)
y_trn[:1341] = 'NORMAL'
y_trn[1341:] = 'PNEUMONIA'
y_trn = y_trn.reshape(5216,1)

#Shuffle Labels 
X_trn, y_trn = shuffle(X_trn, y_trn)

#Onehot encode categorical labels
onehot_encoder = OneHotEncoder(sparse=False)
y_trn = onehot_encoder.fit_transform(y_trn)

#Model
model = keras.Sequential([
    keras.layers.Flatten(input_shape = (100,100)),
    keras.layers.Dense(256, activation = 'selu'),
    keras.layers.Dense(2, activation = 'softmax')
])

adm = optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, amsgrad=False)

model.compile(optimizer = adm,
              loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

for layer in model.layers:
    print(layer, layer.trainable)

model.fit(X_trn, y_trn, validation_data = (X_val, y_val), epochs=30, shuffle = True)

Bashir Kazimi · Accepted Answer

The secret lies in your loss function. When you set from_logits=True in your loss function:

loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True)

it expects that the values come from a layer without a softmax activation, so it performs the softmax operation itself. If you already have a softmax function in your final layer, you should not set from_logits to True, set it to False.

Your model works well without the softmax function and bad with the softmax function for this reason.

Understanding when to and when not to use Softmax as output layer activation

Answers (1)

Related Questions