how to launch a cloud dataflow pipeline when particular set of files reaches Cloud storage from a google cloud function

Question

I have a requirement to create a cloud function which should check for a set of files in a GCS bucket and if all of those files arrives in GCS bucket then only it should launch the dataflow templates for all those files.

My existing cloud function code launches cloud dataflow for each file which comes into a GCS bucket. It runs different dataflows for different files based on naming convention. This existing code is working fine but my intention is not to trigger dataflow for each uploaded file directly. It should check for set of files and if all the files arrives, then it should launch dataflows for those files.

Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result ?

from googleapiclient.discovery import build
import time
def df_load_function(file, context):
filesnames = [
    'Customer_',
    'Customer_Address',
    'Customer_service_ticket'
    ]

# Check the uploaded file and run related dataflow jobs.
for i in filesnames:
    if 'inbound/{}'.format(i) in file['name']:
        print("Processing file: {filename}".format(filename=file['name']))

        project = 'xxx'
        inputfile = 'gs://xxx/inbound/' + file['name']
        job = 'df_load_wave1_{}'.format(i)
        template = 'gs://xxx/template/df_load_wave1_{}'.format(i)
        location = 'asia-south1'
       
        dataflow = build('dataflow', 'v1b3', cache_discovery=False)
        request = dataflow.projects().locations().templates().launch(
            projectId=project,
            gcsPath=template,
            location=location,
            body={
                'jobName': job,
                "environment": {
                "workerRegion": "asia-south1",
                "tempLocation": "gs://xxx/temp" 
            }
            }
        )

        # Execute the dataflowjob
        response = request.execute()
        
        job_id = response["job"]["id"]

I've written the below code for the above functionality. The cloud function is running without any error but it is not triggering any dataflow. Not sure what is happening as the logs has no error.

from googleapiclient.discovery import build
import time
import os
def df_load_function(file, context):
            filesnames = [
    'Customer_',
    'Customer_Address_',
    'Customer_service_ticket_'
]
paths =['Customer_','Customer_Address_','Customer_service_ticket_']
for path in paths :
if os.path.exists('gs://xxx/inbound/')==True :
    # Check the uploaded file and run related dataflow jobs.
        for i in filesnames:
            if 'inbound/{}'.format(i) in file['name']:
                print("Processing file: {filename}".format(filename=file['name']))

                project = 'xxx'
                inputfile = 'gs://xxx/inbound/' + file['name']
                job = 'df_load_wave1_{}'.format(i)
                template = 'gs://xxx/template/df_load_wave1_{}'.format(i)
                location = 'asia-south1'
       
                dataflow = build('dataflow', 'v1b3', cache_discovery=False)
                request = dataflow.projects().locations().templates().launch(
                projectId=project,
                gcsPath=template,
                location=location,
                body={
                'jobName': job,
                "environment": {
                "workerRegion": "asia-south1",
                "tempLocation": "gs://xxx/temp" 
            }
            }
        )

        # Execute the dataflowjob
                response = request.execute()
        
                job_id = response["job"]["id"]
            
            else:
                exit()

Could someone please help me with the above python code. Also my file names contain current dates at the end as these are incremental files which I get from different source teams.

how to launch a cloud dataflow pipeline when particular set of files reaches Cloud storage from a google cloud function

Answers (1)

Related Questions