Henry
Henry

Reputation: 3457

Uploading files to ec2, first to ebs volume then moving to s3

http://farm8.staticflickr.com/7020/6702134377_cf70482470_z.jpg

OK sorry for the terrible drawing but it seemed a better way to organize my thoughts and convey them. I have been wrestling for a while with how to create an optimal de-coupled easily scale-able system for uploading files to a web app on AWS.

Uploading directly to S3 would work except for the fact the files need to be instantly accessible to the uploader for manipulation then once manipulated they can go to s3 where they will be served to all instances.

I played with the idea of creating a SAN with something like glusterfs then uploading directly to that and serving from that. I have not ruled it out but from varying sources the reliability of this solution might be less than ideal (if anyone has better insight on this I would love to hear). In any case I wanted to formulate a more "out of the box" (in the context of AWS) solution.

So to elaborate on this diagram, I want the file to be uploaded to the local filesystem of the instance it happens to go to, which is an EBS volume. The storage location of the file would not be served to the public (i.e. /tmp/uploads/ ) It could still be accessed by the instance through a readfile() operation in PHP so that the user could see and manipulate it right after uploading. Once the user is finished manipulating the file a message to move it to s3 could be queued in SQS.

My question is then once I save the file "locally" on the instance (which could be any instance due to the load balancer), how can I record which instance it is on (in the DB) so that subsequent requests through PHP to read or move the file will find said file.

If anyone with more experience in this has some insight I would be very grateful. Thanks.

Upvotes: 1

Views: 3530

Answers (1)

Stephen Harrison
Stephen Harrison

Reputation: 699

I have a suggestion for a different design that might solve your problem.

Why not always write the file to S3 first? And then copy it to the local EBS file system on whichever node you're on while you're working on it (I'm not quite sure what manipulations you need to do, but I'm hoping it doesn't matter). When you're finished modifying the file, simply write it back to S3 and delete it from the local EBS volume.

In this way, none of the nodes in your cluster need to know which of the others might have the file because the answer is it's always in S3. And by deleting the file locally, you get a fresh version of the file if another node updates it.

Another thing you might consider if it's too expensive to copy the file from S3 every time (it's too big, or you don't like the latency). You could turn on the session affinity in the load balancer (AWS calls it sticky sessions). This can be handled by your own cookie or by the ELB. Now subsequent requests from the same browser come to the same cluster node. Simply check the modified time of the file on the local EBS volume against the S3 copy and replace if it's more recent. Then you get to take advantage of the local EBS file system while the file's being worked on.

Of course there are a bunch of things I don't get about your system. Apologies for that.

Upvotes: 4

Related Questions