Reputation: 331
To simplify the problem, say when a dimension (or a feature) is already updated n times, the next time I see the feature, I want to set the learning rate to be 1/n.
I came up with these codes:
def test_adagrad():
embedding = theano.shared(value=np.random.randn(20,10), borrow=True)
times = theano.shared(value=np.ones((20,1)))
lr = T.dscalar()
index_a = T.lvector()
hist = times[index_a]
cost = T.sum(theano.sparse_grad(embedding[index_a]))
gradients = T.grad(cost, embedding)
updates = [(embedding, embedding+lr*(1.0/hist)*gradients)]
### Here should be some codes to update also times which are omitted ###
train = theano.function(inputs=[index_a, lr],outputs=cost,updates=updates)
for i in range(10):
print train([1,2,3],0.05)
Theano does not give any error, but the training result give Nan sometimes. Does anybody know how to correct this please ?
Thank you for your help
PS: I doubt it is the operations in sparse space which creates problems. So I tried to replace * by theano.sparse.mul. This gave the some results as I mentioned before
Upvotes: 2
Views: 3462
Reputation: 872
I find this implementation from Lasagne very concise and readable. You can use it pretty much as it is:
for param, grad in zip(params, grads):
value = param.get_value(borrow=True)
accu = theano.shared(np.zeros(value.shape, dtype=value.dtype),
broadcastable=param.broadcastable)
accu_new = accu + grad ** 2
updates[accu] = accu_new
updates[param] = param - (learning_rate * grad /
T.sqrt(accu_new + epsilon))
Upvotes: 1
Reputation: 381
I was looking for the same thing and ended up implementing it myself in the style of the resource zuuz already pointed out. So maybe this helps anyone looking for help here.
def adagrad(lr, tparams, grads, inp, cost):
# stores the current grads
gshared = [theano.shared(np.zeros_like(p.get_value(),
dtype=theano.config.floatX),
name='%s_grad' % k)
for k, p in tparams.iteritems()]
grads_updates = zip(gshared, grads)
# stores the sum of all grads squared
hist_gshared = [theano.shared(np.zeros_like(p.get_value(),
dtype=theano.config.floatX),
name='%s_grad' % k)
for k, p in tparams.iteritems()]
rgrads_updates = [(rg, rg + T.sqr(g)) for rg, g in zip(hist_gshared, grads)]
# calculate cost and store grads
f_grad_shared = theano.function(inp, cost,
updates=grads_updates + rgrads_updates,
on_unused_input='ignore')
# apply actual update with the initial learning rate lr
n = 1e-6
updates = [(p, p - (lr/(T.sqrt(rg) + n))*g)
for p, g, rg in zip(tparams.values(), gshared, hist_gshared)]
f_update = theano.function([lr], [], updates=updates, on_unused_input='ignore')
return f_grad_shared, f_update
Upvotes: 1
Reputation: 879
Perhaps you can utilize the following example for implementation of adadelta, and use it to derive your own. Please update if you succeeded :-)
Upvotes: 7