Merging pdf files stored on Amazon S3

Question

Currently I'm using pdfbox to download all my pdf files on my server and then using pdfbox to merge them together. It's working perfectly fine but it's very slow--since I have to download them all.

Is there a way to perform all of this on S3 directly? I'm trying to find a way to do it, even if not in java also in python and unable to do so.

I read the following:

Merging files on S3 Amazon

https://github.com/boazsegev/combine_pdf/issues/18

Is there a way to merge files stored in S3 without having to download them?

EDIT

The way I ended up doing it was using concurrent.futures and implementing it with concurrent.futures.ThreadPoolExecutor. I set a maximum of 8 worker threads to download all the pdf files from s3.

Once all files were downloaded I merged them with pdfbox. Simple.

Bruce P · Accepted Answer

S3 is just a data store, so at some level you need to transfer the PDF files from S3 to a server and then back. You'll probably gain the best speed by doing your conversions on an EC2 instance located in the same region as your S3 bucket.

If you don't want to spin up an EC2 instance yourself just to do this then another alternative may be to make use of AWS Lambda, which is a compute service where you can upload your code and have AWS manage the execution of it.

Merging pdf files stored on Amazon S3

Answers (1)

Related Questions