Reputation: 41
I'm not able to understand the following piece of text from YOLO v1 research paper:
We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision. It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain any object. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on. To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, lambda(coord) and lambda(noobj) to accomplish this. We set lambda(coord) = 5 and lambda(noobj) = .5
What is the meaning of "overpowering" in the first paragraph and why would we decrease the loss from confidence prediction(must it not be already low especially for boxes that don't contain any object) and increase that from bounding box predictions?
Upvotes: 0
Views: 1235
Reputation: 113
Think of it this way, the yolo algorithm works via a NxN grid system, for example if we have a 13x13 grid and 3 anchors per grid that's 3 * 13 ^ 2 meaning 507 different cells, each having it's own object presence score.
If you only have say 5 objects that means that out of those 507 presence scores only 5 are told to predict that there's an object, and due to this class imbalance the model finds it much easier to prioritize those 502 other presence scores where there is no object since there's simply so many of them.
And the noobj lambda is meant to combat the class imbalance between cells with and without an object
Upvotes: 0
Reputation: 91
There are cells that contain objects and that do not. Model often very confident about the absence (confidence around zero) of the object in the grid cell, it make gradient from those cells be much greater than the gradient from cells that do contain objects but not with huge confidence, it overpowers them (i.e around 0.7-0.8). So that we want to consider classification score less important because they are not very "fair", to implement this we make weight for coords prediction greater than for classification.
Upvotes: 2