user1277317
user1277317

Reputation: 127

Which loss function to use when there are many outputs?

Data similar to the images of 1000 x 1 pixels come from the equipment. Somewhere in the image may be 1, 2 or more objects.

I'm going to build a neural network to detect objects. I want to make 1,000 outputs. Each output will indicate whether there is an object in that output or not. Advise me which loss function to use.

It seems to me that "categorical crossentropy" is not suitable, because for example: in the training data, I will indicate that the objects are at 10 and 90 pixels. And the neural network will predict that the objects are at 11 and 89 pixels. It's not a big loss. But for the network, it will be the same loss as if it predict objects at 500 and 900 pixels.

What loss function is suitable for such a case ? I'm using Keras

Upvotes: 1

Views: 875

Answers (3)

Snake Verde
Snake Verde

Reputation: 634

As stated by Siddharth, you'll use two loss functions, since you have a regression problem and a classification problem. See https://www.youtube.com/watch?v=GSwYGkTfOKk for more details. Particularly, pay attention to this slide:

enter image description here

That is, the first task simply classifies if the object is present (logistic regression loss) and the second task finds the bounding boxes (square error loss).

Upvotes: 1

Siddharth Das
Siddharth Das

Reputation: 1115

In object detection mainly we have two tasks- localization and classification. Therefore, we have two loss for two tasks- one is localization and another is classification loss. It is calculated using IoU(intersection over Union). more details here.

Upvotes: 1

Zaw Lin
Zaw Lin

Reputation: 5708

You can use binary cross entropy loss and set the nearest n-bins to the ground truth as labels.

For example, you have 10 pixels and ground truth label is 3 and you selected 3 neighbours.

In typical categorical cross entropy, you would set label as follow using one-hot encoded vector.

[0 0 1 0 0 0 0 0 0 0]

In the solution I suggested, you would use this

[0 1 1 1 0 0 0 0 0 0]

Or it can be this, basically imposing a Gaussian instead of flat labels.

[0 0.5 1 0.5 0 0 0 0 0 0]


Object detection architectures as suggested in the comments also essentially behave the same way I described. Except that they use a quantized scheme

[0 1 0 0 0 0 0 0 0] (actual pixels)

[- - 1 - - - - 0 - -] (group into 2 groups of 5. Your network only has two outputs now. Think of this as binning stage, as the actual pixel belong to group 1. this subnetwork uses binary cross entropy).

[1 0] (first classification network output)

[-1 0] (this second stage can be thought of as delta network, it takes the classified bin value from first stage and outputs a correction value, as the first bin is anchored at index 2, you need to predict -1 to move it to index 1. this network is trained using smoothed l1 loss).

Now there is immediately a problem, what if there are two objects in group 1? This is an unfortunate problem which also exists in object detection architecture. The way to workaround with this is to define slightly shifted and scaled bin(or anchor) positions. This way you can detect at one pixel maximum of N objects where N is the number of anchors defined at that pixel.

Upvotes: 0

Related Questions