I am currently studying google tensorflow object detection API . When I try to retrain the model with Oxford III pet dataset, the training process is very slow. Here is what I found so far: most of time only 2% GPU is utilzed. but CPU utilization is 60%, so It seems GPU is not starved by input, otherwise CPU should be near 100% utilization. I am trying to profile it with tensorflow profiler, but I am in a bit hurry now, any idea or suggestion would be helpful.

Reputation: 2688

tensorflow object detection API: training is very slow

I am currently studying google tensorflow object detection API. When I try to retrain the model with Oxford III pet dataset, the training process is very slow.

Here is what I found so far:

most of time only 2% GPU is utilzed.
but CPU utilization is 60%, so It seems GPU is not starved by input, otherwise CPU should be near 100% utilization.

I am trying to profile it with tensorflow profiler, but I am in a bit hurry now, any idea or suggestion would be helpful.

Upvotes: 1

Answers (3)

Vedanshu

Reputation: 2296

There are many reasons for this to happen. The most common being that there is some problem with your record file. There need to be done some testing before adding an image and it's contour to record file. Some of them are:

First check the image before sending it to the record:

def checkJPG(fn):
    with tf.Graph().as_default():
        try:
            image_contents = tf.read_file(fn)
            image = tf.image.decode_jpeg(image_contents, channels=3)
            init_op = tf.initialize_all_tables()
            with tf.Session() as sess:
                sess.run(init_op)
                tmp = sess.run(image)
        except:
            print("Corrupted file: ", fn)
            return False
    return True

Also, check the height and width of the contour and if any contour is not crossing the borders:

boxW = xmax - xmin
boxH = ymax - ymin
if boxW == 0 or boxH == 0:
    print("...ONE CONTOUR SKIPPED... (boxW | boxH) = 0")
    continue

if boxW*boxH < 100:
    print("...ONE CONTOUR SKIPPED... (boxW*boxH) < 100")
    continue

if xmin / width <= 0 or xmax / width <= 0 or ymin / height <= 0 or ymax / height <= 0:
    print("...ONE CONTOUR SKIPPED... (x | y) <= 0")
    continue
if xmin / width >= 1 or xmax / width >= 1 or ymin / height >= 1 or ymax / height >= 1:
    print("...ONE CONTOUR SKIPPED... (x | y) >= 1")
    continue

One of the other reason is that there is too much data in evaluation record file. It's better to add only 10 images in your evaluation record file and change the evaluation config like this:

eval_config {
  num_visualizations: 10
  num_examples: 10
  eval_interval_secs: 3000
  max_evals: 1
  use_moving_averages: false
}

Upvotes: 1

scott huang

Reputation: 2688

I found the problems. It's the issue with input, my tfrecord file is corrupted somehow, so the input thread hang up sometimes.

Upvotes: 1

Imran Ahmad Ghazali

Reputation: 625

As i can see , it is not utilizing GPU as now, Have you tried to optimise GPU using tensorflow given parameter

https://www.tensorflow.org/performance/performance_guide#optimizing_for_gpu

Upvotes: 0

tensorflow object detection API: training is very slow

Answers (3)

Related Questions