Tensorflow object detection API - validation loss behaviour

Question

I am trying to use the TensorFlow object detection API to recognize a specific object (guitars) in pictures and videos.

As for the data, I downloaded the images from the OpenImage dataset, and derived the .tfrecord files. I am testing with different numbers, but for now let's say I have 200 images in the training set and 100 in the evaluation one.

I'm traininig the model using the "ssd_mobilenet_v1_coco" as a starting point, and the "model_main.py" script, so that I can have training and validation results.

When I visualize the training progress in TensorBoard, I get the following results for train:

and validation loss:

respectively.

I am generally new to computer vision and trying to learn, so I was trying to figure out the meaning of these plots. The training loss goes as expected, decreasing over time. In my (probably simplistic) view, I was expecting the validation loss to start at high values, decrease as training goes on, and then start increasing again if the training goes on for too long and the model starts overfitting.

But in my case, I don't see this behavior for the validation curve, which seems to be trending upwards basically all the time (excluding fluctuations).

Have I been training the model for too little time to see the behavior I'm expecting? Are my expectations wrong in the first place? Am I misinterpreting the curves?

Carlo · Accepted Answer

Ok, I fixed it by decreasing the initial_learning_rate from 0.004 to 0.0001.
It was the obvious solution, considering the wild oscillations of the validation loss, but at first I thought it wouldn't work since there seems to be already a learning rate scheduler in the config file.
However, immediately below (in the config file) there's a num_steps option, and it's stated that

  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.

Honestly, I don't remember if I commented out the num_steps option...if I didn't, it seems my learning rate was kept to the initial value of 0.004, which turned out to be too high.
If I did comment it out (so that the learning scheduler was active), I guess that, instead of the decrease, it still started from too high of a value.
Anyway, it's working much better now, I hope this can be useful if anyone is experiencing the same problem.

Tensorflow object detection API - validation loss behaviour

Answers (1)

Related Questions