How to pass requirements.txt parameter in Dataflow when Dataflow is being triggered by Cloud Function?

Question

Objective- I have a dataflow template (written in python) that has a dependency on pandas and nltk also I want to trigger the dataflow job from cloud function. For this purpose, I have uploaded the code to a bucket and I am ready to specify the template location in the cloud function.

Problem- How to pass the requirements_file parameter that you would normally pass to install any third-party library when you trigger a dataflow job using the discovery google module from cloud function?

Prerequisites- I know this can be done when you are launching a job through the local machine by specifying a local directory path but when I try to specify the path from GCS such as --requirements_file gs://bucket/requirements.txt it gives me an error saying:

The file gs://bucket/requirements.txt cannot be found. It was specified in the --requirements_file command line option.

Rishabh Jain · Accepted Answer

The template of dataflow is not a python or java code instead it is a compiled version of the code that you've written in the python or java. So, when you're creating your template you may pass your requirements.txt in the arguments like you normally do as shown below

python dataflow-using-cf.py \
    --runner DataflowRunner \
    --project  \
    --staging_location gs:///staging \
    --temp_location gs:///temp \
    --template_location ./template1 \
    --requirements_file ./requirements.txt \

The above command will create a file with name template1 which if you read, contains a JSON structure, this file is a compiled version of the Dataflow code that you've written and during the compilation process, it will read your requirements.txt from your local directory and compile its steps. You may then add your template to a bucket and provide the path to the cloud function, you don't have to worry about the requirements.txt file after creating a template.

How to pass requirements.txt parameter in Dataflow when Dataflow is being triggered by Cloud Function?

Answers (1)

Related Questions