Reputation: 725
Adam optimizer has flaws when used with weight decay. In 2018, AdamW optimizer has been proposed.
Is there any standard way to implement AdamW in MXNet framework (python implementation)? There is mxnet.optimizer.Adam
class, but no mxnet.optimizer.AdamW
one (checked in mxnet-cu102==1.6.0
, mxnet==1.5.0
package versions).
P.S. I asked this questions on MXNet forum and on datascience.stackexchange.com, but to no avail.
Upvotes: 1
Views: 357
Reputation: 20287
Short answer: There isn't a standard way to use AdamW in Gluon yet, but there is some existing work in that direction that would make that relatively easy to add.
Longer answer:
People have been asking for this feature - a lot :) See: https://github.com/apache/incubator-mxnet/issues/9182
Gluon-NLP has a working version of AdamW - possibly slightly different from the one in the original paper: https://github.com/eric-haibin-lin/gluon-nlp/blob/df63e2c2a4d6b998289c25a38ffec8f4ff647ff4/src/gluonnlp/optimizer/bert_adam.py
The adamw_update()
operator was added with this pull request: https://github.com/apache/incubator-mxnet/pull/13728 This is first released in MXNet 1.6.0.
Unfortunately, it looks like there isn't a way to use this with gluon.Trainer
directly right now, without copying/modifying the BERTAdam
code (or writing something similar from scratch). That would be a very nice thing to add to Gluon.
Please let me know if you get this working, as I'd love to be able to use that as well.
Upvotes: 1