Monitor and navigate S3 bucket for new files added by users

Question

I have a Rails app that catalogues recorded music products with metadata & wav files.

Previously, my users had the option to send me files via ftp, which i'd monitor with a cron task for new .complete files and then pick it's associated .xml file and a perform metadata import and audio file transfer to S3.

I regularly hit capacity limits on the prior FTP so decided to move the user 'dropbox' to S3, with an FTP gateway to allow users to send me their files. Now it's on S3 and due to S3 not storing the object in folders i'm struggling to get my head around how to navigate the bucket, find the .complete files and then perform my imports as usual.

Can anyway recommend how to 'scan' a bucket for new .complete files.....read the filename and then pass back to my app so that I can then pick up it's xml, wav and jpg files?

The structure of the files in my bucket is like this. As you can see there are two products here. I would need to find both and import their associated xml data and wavs/jpg

42093156-5060156655634/
42093156-5060156655634/5060156655634.complete
42093156-5060156655634/5060156655634.jpg
42093156-5060156655634/5060156655634.xml
42093156-5060156655634/5060156655634_1_01_wav.wav
42093156-5060156655634/5060156655634_1_02_wav.wav
42093156-5060156655634/5060156655634_1_03_wav.wav
42093156-5060156655634/5060156655634_1_04_wav.wav
42093156-5060156655634/5060156655634_1_05_wav.wav
42093156-5060156655634/5060156655634_1_06_wav.wav
42093156-5060156655634/5060156655634_1_07_wav.wav
42093156-5060156655634/5060156655634_1_08_wav.wav
42093156-5060156655634/5060156655634_1_09_wav.wav
42093156-5060156655634/5060156655634_1_10_wav.wav
42093156-5060156655634/5060156655634_1_11_wav.wav
42093163-5060243322593/
42093163-5060243322593/5060243322593.complete
42093163-5060243322593/5060243322593.jpg
42093163-5060243322593/5060243322593.xml
42093163-5060243322593/5060243322593_1_01_wav.wav

Bruno Reis · Accepted Answer

Though Amazon S3 does not formally have the concept of folders, you can actually simulate folders through the GET Bucket API, using the delimiter and prefix parameters. You'd get a result similar to what you see in the AWS Management Console interface.

Using this, you could list the top-level directories, and scan through them. After finding the names of the top-level directories, you could change the parameters and issue a new GET Bucket request, to list the "files" inside the "directory", and check for the existence of the .complete file as well as your .xml and other relevant files.

However, there might be a different approach to your problem: did you consider using SQS? You could make the process that receives the uploads post a message to a queue in SQS, say, completed-uploads, with the name of the folder of the upload that just completed. Another process would then consume the queue and process the finished uploads. No need to scan through the directories in S3.

Just note that, if you try the SQS approach, you might need to be prepared for the possibility of being notified more than once of a finished upload: SQS guarantees that it will eventually deliver posted messages at least once; you might receive duplicated messages! (you can identify a duplicated message by saving the id of the received message on, say, a consistent database, and checking newly received messages against the same database).

Also, remember that, if you use the US Standard Region for S3, then you don't have read-after-write consistency, you have only eventual-consistency, which means that the process receiving messages from SQS might try to GET the object from S3 and get nothing back -- just try again until it sees the object.

Monitor and navigate S3 bucket for new files added by users

Answers (1)

Related Questions