Reputation: 15
I'm trying to understand how to find the location of the bounding box when an object is detected. I used the Tensorflow Object Detection API to detect a mouse in a box. Just for testing purposes of how to retrieve the bounding box coordinates, when the mouse is detected, I want to print "THIS IS A MOUSE" right above its head. However, mine currently prints several inches off-kilter. For example, here is a screenshot from a video of my object detection.
Here is the relevant code snippet:
with detection_graph.as_default():
with tf.Session(graph=detection_graph) as sess:
start = time.time()
while True:
# Read frame from camera
ret, image_np = cap.read()
cv2.putText(image_np, "Time Elapsed: {}s".format(int(time.time() - start)), (50,50),cv2.FONT_HERSHEY_PLAIN,3, (0,0,255),3)
# Expand dimensions since the model expects images to have shape: [1, None, None, 3]
image_np_expanded = np.expand_dims(image_np, axis=0)
# Extract image tensor
image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
# Extract detection boxes
boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
# Extract detection scores
scores = detection_graph.get_tensor_by_name('detection_scores:0')
# Extract detection classes
classes = detection_graph.get_tensor_by_name('detection_classes:0')
# Extract number of detectionsd
num_detections = detection_graph.get_tensor_by_name(
'num_detections:0')
# Actual detection.
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict={image_tensor: image_np_expanded})
# Visualization of the results of a detection.
vis_util.visualize_boxes_and_labels_on_image_array(
image_np,
np.squeeze(boxes),
np.squeeze(classes).astype(np.int32),
np.squeeze(scores),
category_index,
use_normalized_coordinates=True,
line_thickness=8)
for i, b in enumerate(boxes[0]):
if classes[0][i] == 1:
if scores[0][i] >= .5:
mid_x = (boxes[0][i][3] + boxes[0][i][1]) / 2
mid_y = (boxes[0][i][2] + boxes[0][i][0]) / 2
cv2.putText(image_np, 'FOUND A MOUSE', (int(mid_x*600), int(mid_y*800)), cv2.FONT_HERSHEY_PLAIN, 2, (0,255,0), 3)
# Display output
cv2.imshow(vid_name, cv2.resize(image_np, (800, 600)))
#Write to output
video_writer.write(image_np)
if cv2.waitKey(25) & 0xFF == ord('q'):
cv2.destroyAllWindows()
break
cap.release()
cv2.destroyAllWindows()
It's not really clear to me how boxes
works. Can someone explain this line to me: mid_x = (boxes[0][i][3] + boxes[0][i][1]) / 2
? I understand that the 3 and 1 indices represent x_min
, x_max
, but I'm not sure why I'm iterating through boxes[0] only and what i
represents.
Solution Just as ievbu suggested, I needed to convert the midpoint calculation from its normalized values to values for the frame. I found a cv2 function that returns the width and height and used those values to convert my midpoint to pixel location.
frame_h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
frame_w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
...
cv2.putText(image_np, '.', (int(mid_x*frame_w), int(mid_y*frame_h)), cv2.FONT_HERSHEY_PLAIN, 2, (0,255,0), 3)
Upvotes: 0
Views: 1381
Reputation: 488
Boxes are returned in higher dimension, because you can give multiple images and then that dimension would represent every separate image (for one input image you expand dimension with np.expand_dims
). You can see that for visualization it is removed using np.squeeze
and you can remove it manually just by taking boxes[0]
if you process only 1 image. i
represents index of box in boxes array, you need that index to access class and score of the box that you analyze.
The text is not in correct position because returned boxes coordinates are normalized and you have to convert them to match full image size. Here is example how you can convert them:
(im_width, im_height, _) = frame.shape
xmin, ymin, xmax, ymax = box
(xmin, xmax, ymin, ymax) = (xmin * im_width, xmax * im_width,
ymin * im_height, ymax * im_height)
Upvotes: 1