splitting and merging the json files from batch jobs in aws

Question

I am working on a project where I am splitting a single file with a bunch of sentences into chunks for further sending to a third-party API for sentiment analysis.

The third-party API has a limitation of up to 5000 characters of limitation and which is why I am splitting the file into chunks of 40 sentences each. Each chunk will be sent to a batch job via AWS SQS and processed for sentiment analysis from a third-party API. I wanted to merge all of the processed files into one file. I couldn't find the logic to merge the files.

For example,

the input file,

chunk1: sentence1....sentence1... sentence1....

chunk2: sentence2....sentence2... sentence2....

The input file is separated into chunks. Each of these chunks is sent separately to a batch job via SQS. The batch job will call the external API for sentiment analysis. Each file will be uploaded to the S3 bucket as separate files. Output file:

{"Chunk1": "sentence1....sentence1...sentence1....",
"Sentiment": "positive."}

All I wanted is to have the output in a single file but couldn't find the logic to merge the output files.

Logic I tried:

For each input file, I send a UUID to every chunk as ametadata and merge them with another lambda function. But the problem here is I am not sure when all of the chunks are processed and when to invoke the lambda function to merge the files.

If you have any better logic to merge the files, please share it here.

splitting and merging the json files from batch jobs in aws

Answers (1)

Related Questions