Reputation: 1088
My folder structure is as follows:
Project/
--Pipeline.py
--setup.py
--dist/
--ResumeParserDependencies-0.1.tar.gz
--Dependencies/
--Module1.py
--Module2.py
--Module3.py
My setup.py
file looks like this:
from setuptools import setup, find_packages
setup(name='ResumeParserDependencies',
version='0.1',
description='Dependencies',
install_requires=[
'google-cloud-storage==1.11.0',
'requests==2.19.1',
'urllib3==1.23'
],
packages = ['Dependencies']
)
I used the setup.py file to create a tar.gz file using 'python setup.py sdist'. The tar file is in the dist folder as ResumeParserDependencies-0.1.tar.gz. I then specified
setup_options.extra_packages = ['./dist/ResumeParserDependencies-0.1.tar.gz'] in my pipeline options.
However, once I run my pipeline on Dataflow, I get the error 'No module named ResumeParserDependencies'. If I use 'pip install ResumeParserDependencies-0.1.tar.gz' locally, the package installs, and I can see it using 'pip freeze'.
What am I missing to load the package into Dataflow?
Upvotes: 6
Views: 8640
Reputation: 1088
I changed my folder structure and got this to work:
Project/
--Pipeline.py
--setup.py
--Module1/
--__init__.py
--Module2/
--__init__.py
--Module3/
--__init__.py
The setup.py file now looks like this: from setuptools import setup, find_packages
setup(name='ResumeParserDependencies',
version='0.1',
description='Dependencies',
install_requires=[
'google-cloud-storage==1.11.0',
'urllib3==1.23'
],
packages = find_packages()
)
In my pipeline, I specified:
setup_options.setup_file = './setup.py'
And I didn't need:
setup_options.extra_packages = ['./dist/ResumeParserDependencies-0.1.tar.gz']
Reference: find_packages doesn't find my Python file
Upvotes: 14
Reputation: 1452
Usually when this issue happens is from a version mismatch of either the SDK or the Worker Dependencies. To solve your issue, check your Dataflow version and the Worker Dependencies for the SDK version to verify if you're running compatible versions.
Upvotes: 0