chris
chris

Reputation: 4996

tensorflow object detection faster rcnn randomly fails

I am trying to use the new object detection api in tensorflow 1.2, and the example faster-rcnn config, to train on a custom dataset. The error I get is related to some tensor shapes, but it happens seemingly randomly during training, and the exact shape changes too.

INFO:tensorflow:global step 132: loss = 63.3741 (0.262 sec/step)
INFO:tensorflow:global step 133: loss = 33.7362 (0.292 sec/step)
INFO:tensorflow:global step 134: loss = 18.0165 (0.264 sec/step)
INFO:tensorflow:global step 135: loss = 40.5577 (0.266 sec/step)
INFO:tensorflow:global step 136: loss = 24.1086 (0.266 sec/step)
2017-07-10 10:23:49.066345: W tensorflow/core/framework/op_kernel.cc:1165] Invalid argument: Incompatible shapes: [1,60,4] vs. [1,64,4]
     [[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape_1)]]
2017-07-10 10:23:49.066475: W tensorflow/core/framework/op_kernel.cc:1165] Invalid argument: Incompatible shapes: [1,60,4] vs. [1,64,4]
     [[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape_1)]]
2017-07-10 10:23:49.066509: W tensorflow/core/framework/op_kernel.cc:1165] Invalid argument: Incompatible shapes: [1,60,4] vs. [1,64,4]
     [[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape_1)]]
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InvalidArgumentError'>, Incompatible shapes: [1,60,4] vs. [1,64,4]
     [[Node: gradients/Loss/BoxClassifierLoss/Loss/sub_grad/BroadcastGradientArgs = BroadcastGradientArgs[T=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape, gradients/Loss/BoxClassifierLoss/Loss/sub_grad/Shape_1)]]
     [[Node: gradients/FirstStageFeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_1/bottleneck_v1/conv3/convolution_grad/tuple/control_dependency_1/_2621 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_13108_gradients/FirstStageFeatureExtractor/resnet_v1_50/resnet_v1_50/block1/unit_1/bottleneck_v1/conv3/convolution_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

As you can see, it runs for a variable number of steps correctly, and then gives me Invalid argument: Incompatible shapes: [1,60,4] vs. [1,64,4]. What I dont understand is why this error is being triggered, and furthermore the where the incompatible shape comes from, as this changes as well between runs.

As I did convert my dataset into the TF format, I was unsure whether that was my issue. However, I have successfully trained for several days on the same dataset with their ssd implementation, so I think it is safe to say the data is formatted correctly.

EDIT: The label map file is here. Again I would like to point out that this same dataset runs perfectly using ssd.

Upvotes: 2

Views: 1747

Answers (4)

Arun VM
Arun VM

Reputation: 71

You are reading your sequence examples from tf.train.batch with allow_smaller_final_batch=True. The error likely could be the last smaller final batch which is resulting in incompatible shapes with batch sizes

Upvotes: 0

Dileep
Dileep

Reputation: 128

You have to configure num_classes = xx in faster_rcnn_resnet101.config file

Upvotes: 0

Fr&#233;d&#233;ric
Fr&#233;d&#233;ric

Reputation: 77

You can try to start your class id from 1 instead of 0.

item {
  id: 1
  name: 'balloon'
}

It worked for me.

Upvotes: 0

Jonathan Huang
Jonathan Huang

Reputation: 1558

The Tensorflow Object Detection API assumes that the '0' label is reserved for 'none_of_the_above', so one immediate thing to do is to add 1 to every label index in your label map.

It's unclear why things fail (in a hard way) for Faster R-CNN and not for SSD (probably something for us to dig into) --- but I'd be a bit surprised if you got very good results with SSD using that label map.

Upvotes: 1

Related Questions