Reputation: 708
I have a .py pipeline using apache beam that import another module (.py), that is my custom module. I have a strucutre like this:
├── mymain.py
└── myothermodule.py
I import myothermodule.py in mymain.py like this:
import myothermodule
When I run locally on DirectRuner
, I have no problem.
But when I run it on dataflow with DataflowRunner
, I have an error that tells:
ImportError: No module named myothermodule
So I want to know what should I do if I whant this module to be found when running the job on dataflow?
Upvotes: 9
Views: 2903
Reputation: 793
When you run your pipeline remotely, you need to make any dependencies available on the remote workers too.
To do it you should put your module file in a Python package by putting it in a directory with a __init__.py
file and creating a setup.py. It would look like this:
├── mymain.py
├── setup.py
└── othermodules
├── __init__.py
└── myothermodule.py
And import it like this:
from othermodules import myothermodule
Then you can run you pipeline with the command line option --setup_file ./setup.py
A minimal setup.py file would look like this:
import setuptools
setuptools.setup(packages=setuptools.find_packages())
The whole setup is documented here.
And a whole example using this can be found here.
Upvotes: 11