tensorflowmachine-learningdeep-learningobject-detection

Reputation: 499

Why so low Prediction Rate 25 - 40 [sec/1] using Faster RCNN for custom object detection on GPU?

I have trained a faster_rcnn_inception_resnet_v2_atrous_coco model (available here) for custom object Detection.

For prediction, I used object detection demo jupyter notebook file on my images. Also checked the time consumed on each step and found that sess.run was taking all the time.

But it takes around 25-40 [sec] to predict an image of (3000 x 2000) pixel size ( around 1-2 [MB] ) on GPU.

Can anyone figure out the problem here?

I have performed profiling, link to download profiling file

Link to full profiling

System information:
Training and Prediction on Virtual Machine created in Azure portal with Standard_NV6 (details here) which uses NVIDIA Tesla M60 GPU

OS Platform and Distribution - Windows 10
TensorFlow installed from - Using pip pip3 install --upgrade tensorflow-gpu
TensorFlow version - 1.8.0
Python version - 3.6.5
GPU/CPU - GPU
CUDA/cuDNN version - CUDA 9/cuDNN 7

Upvotes: 6

Answers (4)

user3666197

Reputation: 1

Can anyone figure out the problem here ?

Sorry for being here brutally opened & straight fair
on
where the root-cause of the observed performance problem is :

One could not find a worse VM-setup from Azure portfolio for such a computing-intense ( performance-and-throughput motivated ) task. Simply could not - there is no "less" equipped option for this on the menu.

Azure NV6 is explicitly marketed for a benefit of Virtual Desktop users, where NVidia GRID^(R) driver delivers a software-layer of services for "sharing" parts of an also virtualised FrameBuffer for image/video ( desktop graphics pixels, max SP endecs ) shared, among teams of users, irrespective of their terminal device ( yet, 15 users at max per either of both on-board GPUs, for which it was specifically explicitly advertised and promoted on Azure as being it's Key Selling Point. NVidia goes even a step father, promoting this device explicitly for (cit.) Office Users ).

M60 lacks ( obviously, as having been defined such for the very different market-segment ) any smart AI / ML / DL / Tensor-processing features, having ~ 20x lower DP performance, than the AI / ML / DL / Tensor-processing specialised computing GPU devices.

If I may cite,

... "GRID" is the software component that lays over a given set of Tesla ( Currently M10, M6, M60 ) (and previously Quadro (K1 / K2)) GPUs. In its most basic form (if you can call it that), the GRID software is currently for creating FrameBuffer profiles when using the GPUs in "Graphics" mode, which allows users to share a portion of the GPUs FrameBuffer whilst accessing the same physical GPU.

and

No, the M10, M6 and M60 are not specifically suited for AI. However, they will work, just not as efficiently as other GPUs. NVIDIA creates specific GPUs for specific workloads and industry (technological) areas of use, as each area has different requirements._{( credits go to BJones )}

Next,
if indeed willing to spend efforts on this a-priori known worst option á la Carte :

make sure that both GPUs are in "Compute" mode, NOT "Graphics" if you're playing with AI. You can do that using the Linux Boot Utility you'll get with the correct M60 driver package after you've registered for the evaluation. _{( credits go again to BJones )}

which obviously does not seem to have such an option for a non-Linux / Azure-operated Virtualised-access devices.

Resumé :

If striving for an increased performance-and-throughput, best choose another, AI / ML / DL / Tensor-processing equipped GPU-device, where both problem-specific computing-hardware resources were put and there are no software-layers ( no GRID, or at least a disable-option easily available ), that would in any sense block achieving such advanced levels of GPU-processing performance.

Upvotes: 2

Sreeragh A R

Reputation: 3021

TensorFlow takes long time for initial setup. ( Don't worry. It is just a one time process ).

Loading the graph is a heavy process. I executed this code in my CPU. It took almost 40 seconds to complete the program.

The time taken for initial set up like loading the graph was 37 seconds.

The actual time taken for performing object detection was 3 seconds, i.e. 1.5 seconds per image.

If I had given 100 images then the total time taken would be 37 + 1.5 * 100. I don't have to load the graph 100 times.

So in your case, if that took 25 [s], then the initial setup would have taken ~ 23-24 [s]. The actual time should be ~ 1-2 [s].

You can verify it in the code. May use the time module in python:

import time                          # used to obtain time stamps

for image_path in TEST_IMAGE_PATHS:  # iteration of images for detection
    # ------------------------------ # begins here
    start = time.time()              # saving current timestamp
    ...
    ...
    ...
    plt.imshow( image_np )
    # ------------------------------ # processing one image ends here

print( 'Time taken',
        time.time() - start          # calculating the time it has taken
        )

Upvotes: 1

macharya

Reputation: 567

As the website says the image size should be 600x600 and the code ran on Nvidia GeForce GTX TITAN X card. But first please make sure your code is actually running on GPU and not on CPU. I suggest running your code and opening another window to see GPU utilization using command below and see if anything changes.

watch nvidia-smi

Upvotes: 1

Sreeragh A R

Reputation: 3021

It is natural that big images takes more time. Tensorflow object detection performs well even at lower resolutions like 400*400.

Take a copy of original image, resize it to lower resolution to perform object detection. You will get bounding box cordinates. Now calculate corresponding bounding box coordinates for your original higher resolutionimage. Draw bounding box on original image.

i.e

Imagine You have an image of 3000*2000, Make a copy of it and resize it to 300*200. Performing object detection on the resized image, detected an object with bounding box (50,100,150,150) i.e (ymin, xmin, ymax, xmax)

Now the corresponding box coordinates for larger original image will be (500,1000,1500,1500).Draw rectangle on it.

Perform detection on small image then draw bounding box on original image. Performance will be improved tremendously.

Note: TensorFlow support normalized cordinates.

i.e if you have an image with height 100 and ymin = 50 then normalized ymin is 0.5. You can map normalized cordinates to image of any dimension just by multiplying with height or width for y and x cordinates respectively.

I suggest using OpenCV (cv2) for all image processing.

Upvotes: 0

Why so low Prediction Rate 25 - 40 [sec/1] using Faster RCNN for custom object detection on GPU?

Answers (4)

Sorry for being here brutally opened & straight fair onwhere the root-cause of the observed performance problem is :

Resumé :

Related Questions

Sorry for being here brutally opened & straight fair
on
where the root-cause of the observed performance problem is :