Reputation: 91
I'm working on a software that should do realtime people detection on multiple camera devices for an home surveillance system.
I'm currently running Opencv to grab frames from an IP camera and tensorflow to analyze and find objects on them (the code is very similar to the one that can be found in the Tf object detection API). I've also tried different frozen inference graphs from the tensorflow object detection api at this link:
I have a Desktop PC with a CPU Intel Core i7-6700 CPU @ 3.40GHz × 8 and my GPU is NVidia Geforce gtx960ti.
The software is working as intended but is slower than expected (3-5 FPS) and the usage of the CPU is quite high(80-90%) for a single python script that works on only 1 camera device.
Am i doing something wrong? What are the best ways to optimize performances and achieve better FPS and a lower CPU usage to analyze more video feeds at once? So far i've looked into multithreading but i've no idea on how to implement it on my code.
Code snippet:
with detection_graph.as_default():
with tf.Session(graph=detection_graph) as sess:
while True:
frame = cap.read()
frame_expanded = np.expand_dims(frame, axis = 0)
image_tensor = detection_graph.get_tensor_by_name("image_tensor:0")
boxes = detection_graph.get_tensor_by_name("detection_boxes:0")
scores = detection_graph.get_tensor_by_name("detection_scores:0")
classes = detection_graph.get_tensor_by_name("detection_classes:0")
num_detections=detection_graph.get_tensor_by_name("num_detections:0")
(boxes, scores, classes, num_detections) = sess.run(
[boxes, scores, classes, num_detections],
feed_dict = {image_tensor: frame_expanded})
vis_util.visualize_boxes_and_labels_on_image_array(frame,...)
cv2.imshow("video", frame)
if cv2.waitKey(25) & 0xFF == ord("q"):
cv2.destroyAllWindows()
cap.stop()
break
Upvotes: 2
Views: 2420
Reputation: 44
A few things I tried for my project may help,
nvidia-smi -l 5
, and monitor GPU usage and memory usage. Create a small buff between OpenCV and TF, so it won't compete the same GPU resources,
BATCH_SIZE = 200
frameCount = 1
images = []
while (cap.isOpened() and frameCount <= 10000):
ret, image_np = cap.read()
if ret == True:
frameCount = frameCount + 1
images.append(image_np)
if frameCount % BATCH_SIZE == 0:
start = timer()
output_dict_array = run_inference_for_images(images,detection_graph)
end = timer()
avg = (end - start) / len(images)
print("TF inference took: "+str(end - start) +" for ["+str(len(images))+"] images, average["+str(avg)+"]")
print("output array has:" + str(len(output_dict_array)))
for idx in range(len(output_dict_array)):
output_dict = output_dict_array[idx]
image_np_org = images[idx]
vis_util.visualize_boxes_and_labels_on_image_array(
image_np_org,
output_dict['detection_boxes'],
output_dict['detection_classes'],
output_dict['detection_scores'],
category_index,
instance_masks=output_dict.get('detection_masks'),
use_normalized_coordinates=True,
line_thickness=6)
out.write(image_np_org)
##cv2.imshow('object image', image_np_org)
del output_dict_array[:]
del images[:]
else:
break
Use mobilenet models.
Resize capture to 1280 * 720, save capture as a file, and run inference on the file.
I did all above, and archived 12 ~ 16 FPS on a GTX1060(6GB) laptop.
2018-06-04 13:27:03.381783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-06-04 13:27:03.381854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-06-04 13:27:03.381895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-06-04 13:27:03.381933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-06-04 13:27:03.382069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5211 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
===TF inference took: 8.62651109695 for [100] images, average[0.0862651109695]===
Upvotes: 1