LucG
LucG

Reputation: 1334

is crossentropy loss of pytorch different than "categorical_crossentropy" of keras?

I am trying to mimic a pytorch neural network in keras.

I am confident that my keras version of the neural network is very close to the one in pytorch but during training, I see that the loss value of the pytorch network are much lower than the loss values of the keras network. I wonder if this is because I have not properly copied the pytorch network in keras or the loss computation is different in the two framework.

Pytorch loss definition:

loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=0.9, weight_decay=5e-4)

Keras loss definition:

sgd = optimizers.SGD(lr=.1, momentum=0.9, nesterov=True)
resnet.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['categorical_accuracy'])

Note that all the layers in the keras network have been implemented with L2 regularization kernel_regularizer=regularizers.l2(5e-4), also I used he_uniform initialization which I believe is default in pytorch, according to the source code.

The batch size for the two networks are the same: 128.

In the pytorch version, I get loss values around 4.1209 which decreases to around 0.5. In keras it starts around 30 and decreases to 2.5.

Upvotes: 5

Views: 5416

Answers (2)

xashru
xashru

Reputation: 3580

Keras categorical_crossentropy by default uses from_logits=False which means it assumes y_pred contains probabilities (not raw scores) (source). You can choose to use a softmax/sigmoid layer, just make sure to set the from_logits argument accordingly.

PyTorch CrossEntropyLoss accepts unnormalized scores for each class i.e., not probability (source). Thus, if using CrossEntropyLoss you should not use a softmax/sigmoid layer at the end of your model.

If this confuses you, please read this discuss.pytorch post.

Upvotes: 13

LucG
LucG

Reputation: 1334

In my case, the reason why the displayed losses in the two models was different is because Keras prints the sum of the cross entropy loss with the regularization term whereas in the pytorch model only the categorical cross entropy was printed.

Upvotes: 3

Related Questions