Reputation: 2873
Objective- I have a dataflow template (written in python) that has a dependency on pandas and nltk also I want to trigger the dataflow job from cloud function. For this purpose, I have uploaded the code to a bucket and I am ready to specify the template location in the cloud function.
Problem- How to pass the requirements_file parameter that you would normally pass to install any third-party library when you trigger a dataflow job using the discovery google module from cloud function?
Prerequisites- I know this can be done when you are launching a job through the local machine by specifying a local directory path but when I try to specify the path from GCS such as --requirements_file gs://bucket/requirements.txt
it gives me an error saying:
The file gs://bucket/requirements.txt cannot be found. It was specified in the --requirements_file command line option.
Upvotes: 4
Views: 2842
Reputation: 206
The template of dataflow is not a python or java code instead it is a compiled version of the code that you've written in the python or java. So, when you're creating your template you may pass your requirements.txt
in the arguments like you normally do as shown below
python dataflow-using-cf.py \
--runner DataflowRunner \
--project <PROJECT_ID> \
--staging_location gs://<BUCKET_NAME>/staging \
--temp_location gs://<BUCKET_NAME>/temp \
--template_location ./template1 \
--requirements_file ./requirements.txt \
The above command will create a file with name template1
which if you read, contains a JSON structure, this file is a compiled version of the Dataflow code that you've written and during the compilation process, it will read your requirements.txt
from your local directory and compile its steps. You may then add your template to a bucket and provide the path to the cloud function, you don't have to worry about the requirements.txt
file after creating a template.
Upvotes: 4