Kreender
Kreender

Reputation: 294

Cloud dataflow python3 job not solving dependencies

I have a simple apache beam project using python 3 to transform some data and write to big query, it uses a package called texstat, if I run locally everything works, but when I run on dataflow I get the following error:

NameError: name 'textstat' is not defined [while running 'generatedPtransform-441']

This is my current setup.py file:

import setuptools

REQUIRED_PACKAGES = ['textstat==0.5.6']
PACKAGE_NAME = 'my_package'
PACKAGE_VERSION = '0.0.1'


setuptools.setup(
    name=PACKAGE_NAME,
    version=PACKAGE_VERSION,
    description='Example project',
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages(),
)

and this are my pipeline args

pipeline_args = [
    '--project={}'.format('etl-example'),
    '--runner={}'.format('Dataflow'),
    '--temp_location=gs://dataflowtemporal/',
    '--setup_file=./setup.py',
]

and I run it like this

pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(StandardOptions).streaming = True
pipeline = beam.Pipeline(options=pipeline_options)
...
pipeline.run()

I also tried with running this on the terminal before running the job:

python setup.py sdist --formats=gztar

but I get the same results of texstat not being found. Another thing I tries was without setup.py and only with the argument

--requirements_file=./requirements.txt

But again, texstat is not found

At this point I don't know what else to try.

Upvotes: 0

Views: 283

Answers (1)

Yichi Zhang
Yichi Zhang

Reputation: 381

Normally it is because the library is not imported locally in your DoFn.

Alternatively you can try --save_main_session option as mentioned in https://cloud.google.com/dataflow/docs/resources/faq

Upvotes: 2

Related Questions