Miguel Monteiro
Miguel Monteiro

Reputation: 379

Tensorflow new op in google cloud ml engine

I have created a new Tensorflow op, I have compiled and tested it locally and it works.

I now want to use it with Google Cloud ML engine which requires compiling the op in each of the machines in the cloud.

I have managed to include the C++ files and the compiler script in the distribution using setup.py:

from setuptools import setup, find_package

setup(
    name='trainer',
    version='0.1',
    packages=find_packages(),
    package_data={'CPP': ['*.cc', '*.h', 'compile.sh']},
    description = 'Package discription'
)

Now I have to run compile.sh which contains:

TF_INC=$(python -c 'import tensorflow as tf; print(tf.sysconfig.get_include())')

g++ -std=c++11 -D_GLIBCXX_USE_CXX11_ABI=0 -shared target.cc dependency_1.cc -o taget.so -fPIC -I$TF_INC -I$TF_INC/external/nsync/public -O2

The script can be run from python code using subprocess.Popen() so that is not the issue.

The issue is that I don't know how the directory tree is structured within google cloud and so I don't know where to run this script, or how to access its output later on to make use of the new op.

Upvotes: 2

Views: 141

Answers (2)

Miguel Monteiro
Miguel Monteiro

Reputation: 379

Here is my solution. I used the Extension module from setup to build the op dynamically when the package is deployed in the cloud:

import tensorflow as tf
from setuptools import setup, find_packages, Extension


TF_INC = tf.sysconfig.get_include()
TF_LIB = tf.sysconfig.get_lib()

module1 = Extension('op_name',
                    define_macros=[('_GLIBCXX_USE_CXX11_ABI', 0)],
                    include_dirs=[TF_INC, TF_INC + '/external/nsync/public', 'Op/cpp'],
                    sources=['Op/cpp/op_name.cc',
                             'Op/cpp/source_2.cc'],
                    extra_compile_args=['-O2', '-std=c++11', '-fPIC', '-shared'],
                    library_dirs=[TF_LIB],
                    libraries=['tensorflow_framework'],
                    extra_link_args=[],
                    language='c++')

setup(
    name='trainer',
    version='0.1',
    packages=find_packages(),
    package_data={'Op': ['cpp/*.cc', 'cpp/*.h']},
    ext_modules=[module1],
)

Some notes:

  • If you have header files, you have to include the directory for those files in include_dirs. In this case, I have the header files in the same directory as the source files (Op/cpp).
  • However, this does not mean the .h files are packaged. For that you must use package_data={'Op': ['cpp/*.cc', 'cpp/*.h']},so that the .h files are included in the manifest. The .cc files should be included anyway since they are sources, I just have them here because.
  • The compiler used in the cloud is gcc which uses c1plus. Tensorflow official documentation uses g++. I don+t know what implications this has in terms of performance...

Upvotes: 0

Guoqing Xu
Guoqing Xu

Reputation: 482

The easiest way is to build the op locally, upload to your GCS bucket, copy it to the VM(container), and use it via tf.load_op_library. You can use the preinstalled gsutil cli or GCS python client to perform the copy.

Back to the original question: When we start a user job, we first install user code package as root, so in case of python 2.7, it's located at /root/.local/lib/python2.7/site-packages/YOUR_PACKAGE_NAME

Upvotes: 1

Related Questions