Yura Taras
Yura Taras

Reputation: 1293

How to embed files to dataflow job using python API

I'm writing a batch BEAM job in Python and then running it with Google Dataflow. I would like to extract some part of my Python code to .json file and embed it to python package - the same way I would do in Java.

I've created MANIFEST.in file:

include *.json

and also added data_files entry to setup.py:

data_files=[
    ('.', ['config.json'])
],

When I run both setup.py sdist and setup.py bdist I confirm that file gets included to package.

Also I have a code that loads json file:

CONFIG_PATH = Path(__file__).parent / 'config.json'
with path.open() as fp:
    json.load(fp)

When I run the module using DirectRunner, files get's loaded. However, when I try to submit that to DataFlow, it fails as it can't find config.json. I added debug logging which traverses the file system and I can see that this file isn't present in /usr/local/lib/python2.7/dist-packages/ on worker nodes, where all the required libs are installed.

I've looked through Beam documentation, including this one:https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ and I can't find a recommended way to package non-python files in a Beam job so they are available on worker nodes.

Upvotes: 0

Views: 910

Answers (1)

Yura Taras
Yura Taras

Reputation: 1293

Ok, the catch is to add yet another line to setup.py:

include_package_data=True,

Verified it's working.

Upvotes: 3

Related Questions