Why does YOLO divide an image into grid cells?

I'm trying to understand how YOLO works for a project I'm doing. I've gone through the papers, many articles, and blog posts, but I'm still not sure why YOLO divides the entire image into a grid cell and considers each cell for computations. What would happen if we considered the whole image as just one cell (without dividing)? What is the purpose this grid cell serve? Is there a maximum number of objects a particular cell can detect?

Upvotes: 1

Answers (1)

bozcani

Reputation: 19

Grid cells put the network predictions in a more structure form. Each grid cells correspond to a specific region of image, and these cells predicts objects which their centers lay into the region. So, it is about having a structured output representation to use the advantage of spatial regularity of images.

Each grid cell can make a prediction of a vector which has a form [objectness_value, bbox_h, bbox_w, bbox_cx, bbox_cy, p1, p2, .. pn].

objectness_value: how confident the prediction
bbox_h, bbox_w, bbox_cx, bbox_cy: offsets for bounding box height, width, center coordinate in x-axis, and center coordinate in y_axis, respectively.
p1, p2, ..pn: predicted class probabilities of each object category. (n objects in total)

More grid cell means more predictions. If you have one grid cell (image itself), you will have one bounding box prediction. It is not practical because there are likely many objects in images.

Note that a grid cell can make multiple bounding box predictions adding more bbox offsets to the output vector.

Upvotes: 0

Why does YOLO divide an image into grid cells?

Answers (1)

Related Questions