GluonCV - Object detection, set mx.ctx to GPU, but still using all CPU cores

Question

I’m running an object detection routine on a server.
I set the context to the GPU, and I'm loading the model, the parameters and the data on the GPU. The program is reading from a video file or from a rtsp stream, using OpenCV.

When using nvidia-smi, I see that the selected GPU usage is at 20%, which is reasonable. However, the object detection routine is still using 750-1200 % of the CPU (basically, all of the available cores of the server).

This is the code:

def main():

    ctx = mx.gpu(3)

    # -------------------------
    # Load a pretrained model
    # -------------------------
    net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_coco', pretrained=True)

    # Load the webcam handler
    cap = cv2.VideoCapture("video/video_01.mp4")

    count_frame = 0
    while(True):
        print(f"Frame: {count_frame}")

        # Load frame from the camera
        ret, frame = cap.read()


        if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
            cv2.destroyAllWindows()
            cap.release()
            print("Done!!!")
            break

        # Image pre-processing
        frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
        frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
        if isinstance(frame_nd, mx.ndarray.ndarray.NDArray):
            frame_nd.wait_to_read()

        # Run frame through network
        frame_nd = frame_nd.as_in_context(ctx)
        class_IDs, scores, bounding_boxes = net(frame_nd)
        if isinstance(class_IDs, mx.ndarray.ndarray.NDArray):
            class_IDs.wait_to_read()
        if isinstance(scores, mx.ndarray.ndarray.NDArray):
            scores.wait_to_read()
        if isinstance(bounding_boxes, mx.ndarray.ndarray.NDArray):
            bounding_boxes.wait_to_read()


        count_frame += 1



    cv2.destroyAllWindows()
    cap.release()

This is the output of nvidia-smi:

while this is the output of top:

The pre-processing operations are running on the CPU:

frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)

but is it enough to justify such a high CPU usage? In case, can I run them on GPU as well?

EDIT: I modified and copied the whole code, in response to Olivier_Cruchant's comment (thanks!)

Olivier Cruchant · Accepted Answer

Your CPU is likely busy because of the pre-processing load and frequent back-and-forth from memory to GPU because inference seems to be running frame-by-frame I would suggest to try the following:

Run a batched inference (send a batch of N frames to the network) to increase GPU usage and reduce communication
Try using NVIDIA DALI to better use GPU for data ingestion and pre-processing (DALI MXNet reference, DALI mp4 ingestion pytorch example)

GluonCV - Object detection, set mx.ctx to GPU, but still using all CPU cores

Answers (1)

Related Questions