pavan
pavan

Reputation: 1

GCP DataflowRunner ImportErrors

Code is working when using option DirectRunner. But getting import errors when switching it to DataflowRunner. lxml module is not found is the reason. When trying to use setuptools code along with the main code, its still not working ( --setup_file setup.py).

setuptools.setup(
    name='lxml',
    version='4.2.5',
    install_requires=[],
    packages= setuptools.find_packages(),
)

Error: ImportError: No module named lxml [while running 'Run Query']

Any help/suggestions to overcome this error? Thanks.

Upvotes: 0

Views: 165

Answers (1)

jagthebeetle
jagthebeetle

Reputation: 715

The name you pass to the setuptools.setup function is the name of your package, and its dependencies should be specified in the argument install_requires. I would imagine it works with the DirectRunner because the package is installed on your local machine.

The Beam juliaset example provides a sample setup.py file:

REQUIRED_PACKAGES = ['numpy']
setuptools.setup(
    name='juliaset', # this is their package name
    version='0.0.1',
    description='Julia set workflow package.',
    install_requires=REQUIRED_PACKAGES,
    ...)

PyPI dependencies

If lxml is your only dependency, or all your dependencies are on PyPI, you should be able to use the much simpler requirements.txt file. In general, the setup.py approach requires much more boilerplate.

To use requirements.txt, freeze your dependencies:

pip freeze > requirements.txt

And pass the requirements.txt file to your pipeline:

--requirements_file requirements.txt

See also the Beam documentation's page for various dependency patterns for Python.

Upvotes: 1

Related Questions