Reputation: 369
I have been experimenting with TensorFlow (TF) lately and I came across this problem: say I want to compute the value and the gradient of the function
where the x's are indexed differently but all refer to the same vector and the J's are random constants (in physics this is a spin glass model). The gradient wrt is then simply
hence f
sums over N^3 terms and gradf
sums N times over N^2 terms. I have implemented f
by generating all the terms of the sum as a rank 3 tensor and sum-reducing over all the entries. Then to differentiate I apply
tf.gradients(f, xk)[0]
where f is the loss function and xk a variable. Here's a MWE where assume all J's to be 1
import numpy as np
import tensorflow as tf
#first I define the variable
n=10 #size of x
x1 = tf.Variable(tf.zeros([n], dtype='float64'))
x2 = tf.placeholder(tf.float64, shape=[n])
#here I define the cost function
f_tensor = tf.mul(tf.mul(tf.reshape(x1, [n]),
tf.reshape(x2, [n,1])),
tf.reshape(x2, [n,1,1]))
f = tf.reduce_sum(f_tensor)
session = tf.Session()
init = tf.initialize_all_variables()
session.run(init)
#run on test array
xtest = np.ones(n)
res = session.run([f, tf.gradients(f, x1)[0]],
feed_dict={x1 : xtest,
x2 : xtest})
assert res[0] == 1000
assert all(res[1] == np.array([100 for _ in xrange(n)]))
I need to call run
many times independently and I want to reduce the number of variable assignments to just one since x1, x2 refer to the same vector.
Some profiling on a related example for n=200
(on a GeForce GTX 650) showed that
(results are similar for this mwe)
Hence assignment is the most expensive operation when performing the computation on GPUs. Obviously the overhead gets worse for increasing n
, hence partially neutralising the benefit of using GPUs.
Any suggestion on how I could be able to do reduce overhead by transferring x only once?
Also any other suggestion on how to reduce any other overhead would be immensely appreciated.
To show the problem in action I'll follow the suggestion by mrry. If I were to replace all instances of x2 with x1 then the MWE would look like this
#first I define the variable
n=10 #size of x
x1 = tf.Variable(tf.zeros([n], dtype='float64'))
#here I define the cost function
f_tensor = tf.mul(tf.mul(tf.reshape(x1, [n]),
tf.reshape(x1, [n,1])),
tf.reshape(x1, [n,1,1]))
f = tf.reduce_sum(f_tensor)
session = tf.Session()
init = tf.initialize_all_variables()
session.run(init)
#run on test array
xtest = np.ones(n)
session.run(x1.assign(xtest))
res = session.run([f, tf.gradients(f, x1)[0]])
assert res[0] == 1000
for g in res[1]:
assert g == 100
and the second assertion would fail because each entry for the gradient would be 300 instead of 100, as it should be. The reason is that while xi, xj, xk all refer to the same vector, they are symbolically distinct: replacing all x with the same variable would result in the derivative of x^3, which is 3*x^2, hence the result of the second MWE.
P.S. I have also explicitly assigned x1 for clarity
Upvotes: 3
Views: 664
Reputation: 53
I couldn't comment above (not enough reputation), but note that the analytical gradient should be
$$ \frac{\partial f}{\partial x_k} = \sum_{ij} J_{ijk} x_i x_j + \sum_{ij} J_{ikj} x_i x_j + \sum_{ij} J_{kij} x_i x_j. $$
Upvotes: 1
Reputation: 126154
One way to achieve your desired outcome is to use the tf.stop_gradient()
op to make an efficient copy of the variable x1
without it contributing to the gradient:
import numpy as np
import tensorflow as tf
# First define the variable.
n = 10 # size of x
x1 = tf.Variable(tf.zeros([n], dtype=tf.float64))
x2 = tf.stop_gradient(x1)
# Now define the cost function
f_tensor = tf.mul(tf.mul(tf.reshape(x1, [n]),
tf.reshape(x2, [n,1])),
tf.reshape(x2, [n,1,1]))
f = tf.reduce_sum(f_tensor)
session = tf.Session()
init = tf.initialize_all_variables()
session.run(init)
# Run on test array
xtest = np.ones(n)
res = session.run([f, tf.gradients(f, x1)[0]],
feed_dict={x1 : xtest})
assert res[0] == 1000
for g in res[1]:
assert g == 100
Upvotes: 2