Reputation: 15
I have the following Lambda function to run my script in AWS Glue when a new occurrence in an S3 bucket is checked.
import json
import boto3
from urllib.parse import unquote_plus
def lambda_handler(event, context):
bucketName = event["Records"][0]["s3"]["bucket"]["name"]
fileNameFull = event["Records"][0]["s3"]["object"]["key"]
fileName = unquote_plus(fileNameFull)
print(bucket, fileName)
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'My_Job_Glue',
Arguments = {
'--s3_target_path_key': fileName,
'--s3_target_path_bucket': bucketName
}
)
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
At first it works great and I'm partially getting what I need it to do. What actually happens, is that I will always have more than one file occurrence for the same Bucket and this current logic is running my Glue script at each occurrence (if I have 3 files, I have 3 Glue job runs). How could I improve my function in order to run my script only when all new data is identified? Today I have Kafka Connect configured that batches 5000 records and if at the end of a few minutes this batch is not reached it forces as many records as it has there.
Upvotes: 0
Views: 702
Reputation: 852
S3 allows you to notify Lambda functions using Simple Queue Service (SQS). Using SQS allows you to batch messages to Lambda.
Upvotes: 1