Reputation: 127
Data similar to the images of 1000 x 1 pixels come from the equipment. Somewhere in the image may be 1, 2 or more objects.
I'm going to build a neural network to detect objects. I want to make 1,000 outputs. Each output will indicate whether there is an object in that output or not. Advise me which loss function to use.
It seems to me that "categorical crossentropy" is not suitable, because for example: in the training data, I will indicate that the objects are at 10 and 90 pixels. And the neural network will predict that the objects are at 11 and 89 pixels. It's not a big loss. But for the network, it will be the same loss as if it predict objects at 500 and 900 pixels.
What loss function is suitable for such a case ? I'm using Keras
Upvotes: 1
Views: 875
Reputation: 634
As stated by Siddharth, you'll use two loss functions, since you have a regression problem and a classification problem. See https://www.youtube.com/watch?v=GSwYGkTfOKk for more details. Particularly, pay attention to this slide:
That is, the first task simply classifies if the object is present (logistic regression loss) and the second task finds the bounding boxes (square error loss).
Upvotes: 1
Reputation: 1115
In object detection mainly we have two tasks- localization and classification. Therefore, we have two loss for two tasks- one is localization and another is classification loss. It is calculated using IoU(intersection over Union). more details here.
Upvotes: 1
Reputation: 5708
You can use binary cross entropy loss and set the nearest n-bins to the ground truth as labels.
For example, you have 10 pixels and ground truth label is 3 and you selected 3 neighbours.
In typical categorical cross entropy, you would set label as follow using one-hot encoded vector.
[0 0 1 0 0 0 0 0 0 0]
In the solution I suggested, you would use this
[0 1 1 1 0 0 0 0 0 0]
Or it can be this, basically imposing a Gaussian instead of flat labels.
[0 0.5 1 0.5 0 0 0 0 0 0]
Object detection architectures as suggested in the comments also essentially behave the same way I described. Except that they use a quantized scheme
[0 1 0 0 0 0 0 0 0] (actual pixels)
[- - 1 - - - - 0 - -] (group into 2 groups of 5. Your network only has two outputs now. Think of this as binning stage, as the actual pixel belong to group 1. this subnetwork uses binary cross entropy).
[1 0] (first classification network output)
[-1 0] (this second stage can be thought of as delta network, it takes the classified bin value from first stage and outputs a correction value, as the first bin is anchored at index 2, you need to predict -1 to move it to index 1. this network is trained using smoothed l1 loss).
Now there is immediately a problem, what if there are two objects in group 1? This is an unfortunate problem which also exists in object detection architecture. The way to workaround with this is to define slightly shifted and scaled bin(or anchor) positions. This way you can detect at one pixel maximum of N objects where N is the number of anchors defined at that pixel.
Upvotes: 0