Reputation: 294
I have a simple apache beam project using python 3 to transform some data and write to big query, it uses a package called texstat, if I run locally everything works, but when I run on dataflow I get the following error:
NameError: name 'textstat' is not defined [while running 'generatedPtransform-441']
This is my current setup.py file:
import setuptools
REQUIRED_PACKAGES = ['textstat==0.5.6']
PACKAGE_NAME = 'my_package'
PACKAGE_VERSION = '0.0.1'
setuptools.setup(
name=PACKAGE_NAME,
version=PACKAGE_VERSION,
description='Example project',
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
)
and this are my pipeline args
pipeline_args = [
'--project={}'.format('etl-example'),
'--runner={}'.format('Dataflow'),
'--temp_location=gs://dataflowtemporal/',
'--setup_file=./setup.py',
]
and I run it like this
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(StandardOptions).streaming = True
pipeline = beam.Pipeline(options=pipeline_options)
...
pipeline.run()
I also tried with running this on the terminal before running the job:
python setup.py sdist --formats=gztar
but I get the same results of texstat not being found. Another thing I tries was without setup.py and only with the argument
--requirements_file=./requirements.txt
But again, texstat is not found
At this point I don't know what else to try.
Upvotes: 0
Views: 283
Reputation: 381
Normally it is because the library is not imported locally in your DoFn.
Alternatively you can try --save_main_session option as mentioned in https://cloud.google.com/dataflow/docs/resources/faq
Upvotes: 2