Can I use a Sigmoid activation for my output layer, even if my CNN model is doing a regression?

Question

Final objective: Object Midpoint calculation.

I have a small dataset (around 120 images), which has an object (the same in all cases), and the labels are the normalized x,y coordinates of the midpoint of the object in the image (always between 0 and 1)

e.g. x = image_005 ; y = (0.1, 0.15) for an image with the object placed near the bottom left corner

I am trying to use a ResNet architecture but customized for my image-size (all are identical images). Since the output values are always between 0 and 1, for both coordinates, I was wondering if it is possible to use Sigmoid activation in my last layer:

 X = Dense(2, activation='sigmoid', name='fc', kernel_initializer = glorot_uniform(seed=0))(X)

instead of a linear activation (as is advised often when you are trying to achieve a regression result)

For the loss function, I use MSE, with 'rmsprop' optimizer and in addition to accuracy and MSE, I have written a custom metric for telling me if the predicted points are off from the labels by more than 5%

model.compile(optimizer='rmsprop', loss='mean_squared_error', metrics=['mse','acc',perc_midpoint_err])

I am not getting good results, after training the model on around 150 epochs (I experimented with different batch sizes too)

Should I change the activation layer to linear? Or is there a different modification I can do to my model? Or is ResNet completely unsuitable for this task?

brbr · Accepted Answer

Your task is related to object detection. The difference is, that you seem to have only one object in each of your images, whereas in detection there may be multiple objects or no object present. For object detection, there are networks such as YOLOv3 (https://pjreddie.com/media/files/papers/YOLOv3.pdf) or Single Shot Multibox Detector - SSD (https://arxiv.org/pdf/1512.02325.pdf) but also ResNet can be trained as an object detection network (as in this paper: https://arxiv.org/pdf/1506.01497.pdf)

I will shortly describe how YOLO solves the regression problem for bounding box x,y coordinates:

YOLO uses a sigmoid activation function for x,y
It devides the image into grid cells and predicts offsets for a potential object in each grid-cell. This may be helpful in case you have large images or objects at multiple locations.
The original paper uses MSE as a loss function, but in my favorite keras-reimplementation they use crossentropy loss with the Adam optimizer.

In principle your setup looks fine to me. But there are many things which could result in poor performance, since you don't tell about the domain of your dataset: Are you using a pretrained network or are you training from scratch? Is it a new category which you are to learning or an object category the network has seen before? etc.

Here are some ideas which you could try:

change the optimizer (to SGD or Adam)
change the learning rate (better smaller than too large)
increase your dataset size. For retraining the network for a new object category my rule of thumb is to use about 500-1000 images. For retraining from scratch you need orders of magnitude more.
you may want to check out YOLO or SSD and modify those networks for your case

I hope you find some inspiration for your solution.

Can I use a Sigmoid activation for my output layer, even if my CNN model is doing a regression?

Answers (2)

Related Questions