How to run Lambda function for multiples files in AWS S3

Question

I have the following Lambda function to run my script in AWS Glue when a new occurrence in an S3 bucket is checked.

import json
import boto3
from urllib.parse import unquote_plus

def lambda_handler(event, context):

  bucketName = event["Records"][0]["s3"]["bucket"]["name"]
  fileNameFull = event["Records"][0]["s3"]["object"]["key"]
  fileName = unquote_plus(fileNameFull)

  print(bucket, fileName)

  glue = boto3.client('glue')

  response = glue.start_job_run(
    JobName = 'My_Job_Glue',
    Arguments = {
      '--s3_target_path_key': fileName,
      '--s3_target_path_bucket': bucketName
    }
  )

  return {
    'statusCode': 200,
    'body': json.dumps('Hello from Lambda!')
  }

At first it works great and I'm partially getting what I need it to do. What actually happens, is that I will always have more than one file occurrence for the same Bucket and this current logic is running my Glue script at each occurrence (if I have 3 files, I have 3 Glue job runs). How could I improve my function in order to run my script only when all new data is identified? Today I have Kafka Connect configured that batches 5000 records and if at the end of a few minutes this batch is not reached it forces as many records as it has there.

How to run Lambda function for multiples files in AWS S3

Answers (1)

Related Questions