Melissa Guo
Melissa Guo

Reputation: 1088

How can I install a python package onto Google Dataflow and import it into my pipeline?

My folder structure is as follows:

Project/
 --Pipeline.py
 --setup.py
 --dist/
  --ResumeParserDependencies-0.1.tar.gz
 --Dependencies/
        --Module1.py
        --Module2.py
        --Module3.py

My setup.py file looks like this:

from setuptools import setup, find_packages

setup(name='ResumeParserDependencies',
  version='0.1',
  description='Dependencies',
  install_requires=[
   'google-cloud-storage==1.11.0',
   'requests==2.19.1',
   'urllib3==1.23'
    ],
  packages = ['Dependencies']
 )

I used the setup.py file to create a tar.gz file using 'python setup.py sdist'. The tar file is in the dist folder as ResumeParserDependencies-0.1.tar.gz. I then specified

setup_options.extra_packages = ['./dist/ResumeParserDependencies-0.1.tar.gz'] in my pipeline options.

However, once I run my pipeline on Dataflow, I get the error 'No module named ResumeParserDependencies'. If I use 'pip install ResumeParserDependencies-0.1.tar.gz' locally, the package installs, and I can see it using 'pip freeze'.


What am I missing to load the package into Dataflow?

Upvotes: 6

Views: 8640

Answers (2)

Melissa Guo
Melissa Guo

Reputation: 1088

I changed my folder structure and got this to work:

Project/
--Pipeline.py
--setup.py
--Module1/
    --__init__.py
--Module2/
    --__init__.py
--Module3/
    --__init__.py

The setup.py file now looks like this: from setuptools import setup, find_packages

setup(name='ResumeParserDependencies',
  version='0.1',
  description='Dependencies',
  install_requires=[
   'google-cloud-storage==1.11.0',
   'urllib3==1.23'
    ],
  packages = find_packages()
 )

In my pipeline, I specified:

setup_options.setup_file = './setup.py'

And I didn't need:

setup_options.extra_packages = ['./dist/ResumeParserDependencies-0.1.tar.gz']

Reference: find_packages doesn't find my Python file

Upvotes: 14

Héctor Neri
Héctor Neri

Reputation: 1452

Usually when this issue happens is from a version mismatch of either the SDK or the Worker Dependencies. To solve your issue, check your Dataflow version and the Worker Dependencies for the SDK version to verify if you're running compatible versions.

Upvotes: 0

Related Questions