Reputation: 1
I am running an image segmentation code on Pytorch, based on the architecture of Linknet. The optimizer is initially set as:
self.optimizer = torch.optim.Adam(params=self.net.parameters(), lr=lr)
Then I change it to Nesterov to improve the performance, like:
self.optimizer = torch.optim.SGD(params=self.net.parameters(), lr=lr, momentum=0.9, nesterov=True)
However, the performance is poorer using Nesterov. When I use Adam the loss function can converge to 0.19. But the loss function can only converge to 0.34 when I use Nesterov.
By the way, the learning rate is divided by 5 if no decrease of loss in 3 consecutive epochs, and lr can adjust 3 times. After that, the training process finish.
I am wondering why this happens and what should I do for optimization? Thanks a lot for the replys:)
Upvotes: 0
Views: 1081
Reputation: 777
Seems like your question relies on the assumption that SGD with Nesterov would definitely perform better than Adam. However, there is no learning algorithm that is better than another no matter what. You always have to check it given your model (layers, activation functions, loss, etc.) and dataset.
Are you increasing the number of epochs for SGD? Usually, SGD takes much longer to converge than Adam. Note that recent studies show that despite training faster, Adam generalizes worse to the validation and test datasets (https://arxiv.org/abs/1712.07628). An alternative to that is to start the optimization with Adam, and then after some epochs, change the optimizer to SGD.
Upvotes: 2