Dominick Augustine
Dominick Augustine

Reputation: 25

How are ground truth bounding boxes created for a deep learning training dataset?

I'm working on a project where I'd like to use mask RCNN to identify objects in a set of images. But, I'm having a hard time understanding how bounding boxes(encoded pixels) are created for the ground truth data. Can anyone point me in the right direction or explain this to me further?

Upvotes: 1

Views: 2890

Answers (1)

afarley
afarley

Reputation: 815

Bounding boxes are typically labeled by hand. Most deep-learning people use a separate application for tagging. I believe this package is popular:

https://github.com/AlexeyAB/Yolo_mark

I developed my own RoR solution for tagging, because it's helpful to distribute the work among several people. The repository is open-source if you want to take a look:

https://github.com/asfarley/imgclass

I think it's a bit misleading to call this 'encoded pixels'. Bounding boxes are a labelled rectangle data-type, which means they are entirely defined by the type (car, bus, truck) and the (x,y) coordinates of the rectangle's corners.

The software for defining bounding-boxes generally consists of an image-display element, plus features to allow the user to drag bounding-boxes on the UI. My application uses a radio-button list to select the object type (car, bus, etc); then the user draws a bounding-box.

The result of completely tagging an image is a text-file, where each row represents a single bounding-box. You should check the library documentation for your training algorithm to understand exactly what format you need to input the bounding boxes.

In my own application, I've developed some features to compare bounding-boxes from different users. In any large ML effort, you will probably encounter some mis-labelled images, so you really need a tool to identify this because it can severely degrade your results.

Upvotes: 1

Related Questions