Rich
Rich

Reputation: 93

Do input & output tensors to TensorFlow's OpKernel::Compute() function change their address across multiple function calls?

I am working on supporting TensorFlow on a new architecture.

Consider the following TensorFlow code:

import tensorflow as tf
import random as r

def random_10x10():
  return [[r.normalvariate(1.0,1.0) for i in range(10)] for j in range(10)]

a = tf.placeholder(tf.float32, shape=[10, 10])
b = tf.placeholder(tf.float32, shape=[10, 10]) 
c = tf.placeholder(tf.float32, shape=[10, 10]) 
d = tf.placeholder(tf.float32, shape=[10, 10]) 

with tf.device('/device:CPU:0'):
  mm1 = tf.matmul(a,b)
  mm2 = tf.matmul(c,d)
  output = tf.add(mm1,mm2)

sess = tf.Session() 
for i in xrange(10):
  print sess.run(output, 
      feed_dict={ a:random_10x10(), b:random_10x10(),
                  c:random_10x10(), d:random_10x10()} )

There are two matmul blocks present in the execution graph of the TensorFlow program, and during execution they are represented using two instantiations of the MatMulOp subclass of OpKernel, found in tensorflow/core/kernels/matmul_op.cc. The first thing done in MatMulOp::Compute() is to grab the addresses of the input tensors:

  void Compute(OpKernelContext* ctx) override {
    const Tensor& a = ctx->input(0);
    const Tensor& b = ctx->input(1);
...

My understanding of TensorFlow is that during each iteration of sess.run() above, the two instantiations of MatMulOp do not change. For each MatMul block, can I expect the addresses of the inputs to remain constant over the iterations, or could it be that in the seventh iteration of the sess.run() call, ctx->input(0) will have a different value than it did in the sixth?

The Compute() method also invokes ctx->allocate_output(), which eventually wraps our own architecture's allocator. Is it OK to allocate the output block once and then just keep using the same block over future runs in the same session?

Upvotes: 2

Views: 927

Answers (1)

vrv
vrv

Reputation: 411

Thanks for the question!

TL;DR: yes, input and output tensors will change their address across different invocations of the Run() call.

Details:

If there are two different MatMul ops in the graph, there will be two instantiations of the MatMulOp OpKernel (held cached in an OpSegment, which is looked up / created for an Op in the graph and subsequeentally cached for every future call to Session::Run()).

The addresses of the inputs are not guaranteed to remain constant over iterations. In fact, it is almost certain that this is not the case, partially because execution of the dataflow graph is dynamic, not deterministic.

ctx->input(0) is memory allocated likely from another call from the input op's "allocate_output" function, so the answer for both the input and the output memory locations is the same. The allocate_output function eventually delegates to the device's "Allocator" implementation to allocate the memory of the appropriate size (and alignment). On CPU, the current implementation delegates to malloc(), so just like malloc(), you may get different memory on every call to Run(). On GPU, we use a custom GPU allocator (the BFCAllocator class) to dynamically allocate memory, and its properties are similar to malloc() in that different memory may be returned based on the allocation/release order of memory through the Allocator.

So in general, the memory returned by the call to allocate_output() is handled by the Allocator implementations for a device. The CPU and GPU implementations do not provide a stable pointer guarantee across Runs of the same graph.

However, if you are implementing a custom device, you likely have to implement a custom allocator for your device, and it may be possible for you to write the allocator in such a way that it returns the same memory for the same output from a graph. But that would require figuring out how to pass the identifier of the op to the Allocator so that you could return the same memory each time.

TensorFlow intentionally does dynamic memory allocation for at least a few reasons:

1) The execution order of a dataflow graph may depend on external inputs, so customizing a strict schedule may lead to unnecessary stalls. Dynamic execution order ensures that ops only execute when all of their inputs are ready.

2) Shapes can be dynamic in TensorFlow (e.g., your graph can handle variable batch sizes), which means that an op may need to allocate different amount of memory for the same op's output from Run() to Run()! This is a big reason why we cannot and do not provide such guarantees.

We understand there are cases where a device wants to optimize for one instantiation of a graph (with fixed sizes), so the device can pre-plan for the entire dataflow graph and amortize its execution over several attempts. These cases are generally better suited by the XLA compiler framework (https://www.tensorflow.org/versions/master/resources/xla_prerelease), but it may be possible to get it to work for limited situations / graphs without XLA using the existing device framework.

Upvotes: 3

Related Questions