jmf_zaiecp
jmf_zaiecp

Reputation: 331

How to code adagrad in python theano

To simplify the problem, say when a dimension (or a feature) is already updated n times, the next time I see the feature, I want to set the learning rate to be 1/n.

I came up with these codes:

def test_adagrad():
  embedding = theano.shared(value=np.random.randn(20,10), borrow=True)
  times = theano.shared(value=np.ones((20,1)))
  lr = T.dscalar()
  index_a = T.lvector()
  hist = times[index_a]
  cost = T.sum(theano.sparse_grad(embedding[index_a]))
  gradients = T.grad(cost, embedding)
  updates = [(embedding, embedding+lr*(1.0/hist)*gradients)]
  ### Here should be some codes to update also times which are omitted ### 
  train = theano.function(inputs=[index_a,   lr],outputs=cost,updates=updates)
  for i in range(10):
    print train([1,2,3],0.05) 

Theano does not give any error, but the training result give Nan sometimes. Does anybody know how to correct this please ?

Thank you for your help

PS: I doubt it is the operations in sparse space which creates problems. So I tried to replace * by theano.sparse.mul. This gave the some results as I mentioned before

Upvotes: 2

Views: 3462

Answers (3)

minhle_r7
minhle_r7

Reputation: 872

I find this implementation from Lasagne very concise and readable. You can use it pretty much as it is:

for param, grad in zip(params, grads):
    value = param.get_value(borrow=True)
    accu = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                         broadcastable=param.broadcastable)
    accu_new = accu + grad ** 2
    updates[accu] = accu_new
    updates[param] = param - (learning_rate * grad /
                              T.sqrt(accu_new + epsilon))

Upvotes: 1

vkoe
vkoe

Reputation: 381

I was looking for the same thing and ended up implementing it myself in the style of the resource zuuz already pointed out. So maybe this helps anyone looking for help here.

def adagrad(lr, tparams, grads, inp, cost):
    # stores the current grads
    gshared = [theano.shared(np.zeros_like(p.get_value(),
                                           dtype=theano.config.floatX),
                             name='%s_grad' % k)
               for k, p in tparams.iteritems()]
    grads_updates = zip(gshared, grads)
    # stores the sum of all grads squared
    hist_gshared = [theano.shared(np.zeros_like(p.get_value(),
                                                dtype=theano.config.floatX),
                                  name='%s_grad' % k)
                    for k, p in tparams.iteritems()]
    rgrads_updates = [(rg, rg + T.sqr(g)) for rg, g in zip(hist_gshared, grads)]

    # calculate cost and store grads
    f_grad_shared = theano.function(inp, cost,
                                    updates=grads_updates + rgrads_updates,
                                    on_unused_input='ignore')

    # apply actual update with the initial learning rate lr
    n = 1e-6
    updates = [(p, p - (lr/(T.sqrt(rg) + n))*g)
               for p, g, rg in zip(tparams.values(), gshared, hist_gshared)]

    f_update = theano.function([lr], [], updates=updates, on_unused_input='ignore')

    return f_grad_shared, f_update

Upvotes: 1

zuuz
zuuz

Reputation: 879

Perhaps you can utilize the following example for implementation of adadelta, and use it to derive your own. Please update if you succeeded :-)

Upvotes: 7

Related Questions