Reputation: 1293
I'm writing a batch BEAM job in Python and then running it with Google Dataflow. I would like to extract some part of my Python code to .json file and embed it to python package - the same way I would do in Java.
I've created MANIFEST.in
file:
include *.json
and also added data_files
entry to setup.py
:
data_files=[
('.', ['config.json'])
],
When I run both setup.py sdist
and setup.py bdist
I confirm that file gets included to package.
Also I have a code that loads json file:
CONFIG_PATH = Path(__file__).parent / 'config.json'
with path.open() as fp:
json.load(fp)
When I run the module using DirectRunner, files get's loaded. However, when I try to submit that to DataFlow, it fails as it can't find config.json
. I added debug logging which traverses the file system and I can see that this file isn't present in /usr/local/lib/python2.7/dist-packages/
on worker nodes, where all the required libs are installed.
I've looked through Beam documentation, including this one:https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/ and I can't find a recommended way to package non-python files in a Beam job so they are available on worker nodes.
Upvotes: 0
Views: 910
Reputation: 1293
Ok, the catch is to add yet another line to setup.py
:
include_package_data=True,
Verified it's working.
Upvotes: 3