Reputation: 463
I'm working on a project that incorporates file storage and sharing features and after months of researching the best method to leverage AWS I'm still a little concerned.
Basically my decision is between using EBS storage to house user files or S3. The system will incorporate on-the-fly zip archiving when the user wants to download a handful of files. Also, when users download any files I don't want the URL to the files exposed.
The two best options I've come up with are:
Have an EC2 instance which has a number of EBS volumes mounted to store user files.
After files are uploaded and processed, the system pushes those files to an S3 bucket for long term storage. When files are requested I will retrieve the files from S3 and output back to the client.
Are any of my assumptions flawed? Is there a better way of managing massive amounts of file storage?
Upvotes: 28
Views: 17297
Reputation: 501
The main question is where are you hosting. Since you said you are using a ec2 instance, which means you are leveraging AWS, I would go for EBS and then EFB if you need to scale.
S3 is great, but IMO it is mainly for if you are hosting your site with a different provider, like Namecheap, etc, and just want to use AWS for a database.
I don't think reliability durability matters much, especially when you consider that you can back up snapshots of your Ec2 and EFB.
I would go solely based on price. See which one is cheaper. If there is a significant performance difference (2-5 seconds USER wait time), I would maybe consider spending more for the faster one.
EFB is a method of scaling and might be cheaper than doing EBS. I believe Amazon recommends using EBS until it gets to a certain size and then switching to EFB.
Upvotes: 0
Reputation: 81454
Some considerations:
I would keep everything on S3, download the files as required to zip them into a package. Then upload the zip to S3 and deliver to the user an S3 Signed URL to download from S3.
You could allow the user to download from your EC2 instance, but lots of users have error problems, retry issues, slow bandwidth, etc. If the zip files are small (less then 100 MB) deliver locally, otherwise upload to S3 and let S3 deal with the user download issues.
Another option would be to create a Lambda function that creates the zip file and stores on S3. Now you don't have to worry about network bandwidth or scaling. The Lambda function could either return to you the S3 URL, which you deliver to the browser, or Lambda could email the customer a link. Look into SES for this. Note: The Lambda file system only has 512 MB of space, memory can be allocated up to 1.5 GB. If you are generating zip files larger than this, Lambda won't work (at this time). However, you could create multiple zip files (part1, part2, ...)
Upvotes: 1
Reputation: 5649
If you are insistent on serving the zip files directly from your EC2 instance using S3 will just be more complicated than storing them locally. But S3 is much more durable than any EC2 storage volumes, so I'd recommend using it anyway if the files need to be kept a long time.
You say you don't want to expose the file URLs directly. If that's just because you don't want people to be able to bookmark them and bypass your service authentication in the future, S3 has a great solution:
1 - Store the files you want to serve (zipped up if you want it that way) in a private S3 bucket.
2 - When a user requests a file, authenticate the request and then redirect valid requests to a signed, temporary S3 URL of the file. There are plenty of libraries in a variety of languages that can create those URLs.
3 - The user downloads the file directly from S3, without it having to pass through your EC2 instance. That saves you bandwidth and time, and probably gives the fastest download possible to the user.
This does expose a URL, but that's probably okay. There's no problem if the user saves the URL, because it will not work after the expiration time you set on it. For my service I set that time to 5 minutes. Since it is digitally signed, the user can't change the expiration time in the URL without invalidating the signature.
Upvotes: 5
Reputation: 2216
If your service is going to be used by an undetermined number of users, it is important to bear in mind that scaleability will always be a concern, regardless of the option adopted, you will need to scale the service to meet demand, so it would be convenient assume that your service will be running in a Auto Scaling Group with a pool of EC2 instances and not a single instance.
Regarding the protection of the URL to allow only authorized users download the files, there are many ways to do this without requiring your service to act as an intermediate, then you will need to deal with at least two issues:
File name predictability: to avoid URL predictability, you could name the uploaded file as a hash and store the original filenames and ownerships in a database like SimpleDB, optionally you can set a http header such as "Content-Disposition: filename=original_file_name.ext" to advise users browser to name the downloaded file accordingly.
authorization: when the user ask to download a given file your service, issue a temporary authorization using Query String Authentication or Temporary Security Credentials for that specific user giving read access to the file for a period of time then your service redirects to the S3 bucket URL for direct download. This can greatly offload your EC2 pool instances, making then available to process other requests more quickly.
To reduce the space and traffic to your S3 bucket (remember you pay per GB stored and transferred), I would also recommend compressing each individual file using a standard algorithm like gzip before uploading to S3 and set the header " Content-Encoding: gzip " in order to make automatic uncompression work with users browser. If your programming language of choice is Java, I suggest taking a look at the plugin code webcache-s3-maven-plugin that I created to upload static resources from web projects.
Regarding the processing time in compressing a folder, you will frequently be unable to ensure that the folders are going to be compressed in short time, in order to allow the user to download it immediately, since eventually there could be huge folders that could take minutes or even hours to be compressed. For this I suggest you to use the SQS and SNS services in order to allow asynchronous compression processing, it would work as follows:
In this scenario you could have two Auto Scaling Groups, respectively frontend and backend, that may have different scaleability restrictions.
Upvotes: 22
Reputation: 10033
Using S3 is a better option for.this use case. It scales better and will be simplier. Why are you concerned about it being slow? Transfers between EC2 and S3 are pretty snappy.
Upvotes: 2