Architecture for uploading large files from many end points to the cloud storage

Question

I am working on a desktop app that offers uploading to the cloud storage. Storage providers have an easy way to upload files. You get accessKeyId and secretAccessKey and you are ready to upload. I am trying to come up with optimal way for upload files.

Option 1. Pack each app instance with access keys. This way files can be uploaded directly to cloud without the middle man. Unfortunately, I cannot execute any logic before uploading to the cloud. For example.. if each users has 5GB of storage available, I cannot verify this constraint right at storage provider. I haven't found any provider that does that. I might send a request to my own server before upload to make verification, but since keys are hardcoded in app and I am sure this is an easy exploit.

Option 2. Send each uploaded file to a server, where constraint logic can be executed and forward the file to the final cloud storage. This approach suffers from bottleneck at the server. For example, if 100 users start uploading(or downloading) 1 GB file and if the server has bandwidth speed 1000Mb/s, than each user uploads at only 10Mb/s = 1.25MB/s.

Option 2 seems to be the way to go, because I get control over who can upload and keys aren't shared publicly. I am looking for tips to minimise bandwidth bottleneck. What approach is recommended to handle simultaneous uploading of large files to the cloud storage? I am thinking of deploying many low-cpu and low-memory instances and use streaming instead of buffering the whole file first and sending it after.

bgdnlp · Accepted Answer

I believe asking for architecture validation and improvement is out of scope of this forum, but I'll bite. Also, some aspects are not clear. I assume you mean you'll upload files to something like S3, but you'll limit how much users can upload based on how much they are paying.

You can go with Option 1. Upload directly to storage provider, but validate with your server first. You'll need to be able to:

Identify each user. A simple UUID might do the trick, or go full user/pass.
Have a database that keeps track of each client's usage.
Encrypt communication between desktop app and your server with your own private key. That is, in addition to HTTPS. If you're not clear on how public-key cryptography works, you should look it up.
Use temporary access keys for each provider and find a way to deal with that.

These will increase your cost. Not as much as Option 2 will though.

You app will make an API call to your server before uploading in order to determine if the upload is valid. Any answer (or lack of one) that is not a good answer means the upload fails. That also means you're introducing a single point of failure in your architecture and you better make sure your server is always up and available as long as you still have users, otherwise you'll be in breach of Wheaton’s Law. My advice, go serverless here.

You will use temporary access_key/secret_key pairs to upload the files. The desktop app will upload the file directly to whatever provider you're dealing with, but it will use a key/secret pair that changes every, say, 12 hours. Each user gets their own pair and you need to make sure that a user only has access to their own files. Otherwise they'll be able to access everyone's files and you'll be breaking Wheaton’s Law. This way, even if they somehow figure out what the secret is they will only have access for 12 hours at most, after which you will change the keys and cut them off.

All communication between the app and your server is encrypted using public-key cryptography. The private key is stored on your server, the user gets the public key. That way you can easily update the encryption keys if needed, because public key is public. Remember, this provides encryption, not authentication.

You can easily invalidate a user's access by changing their access_key/secret_key pair(s) used to communicate directly with the server provider(s) and the private key used to communicate with your server.

Your server should keep track of each user's files and validate that what is in your server-side database is the same with what's on storage. Do it regularly. Daily, weekly, every 2 hours, whatever works for you. If you find inconsistencies, investigate. Maybe they are trying to cheat. Or maybe your app has a bug. That means you have to be able to identify at the storage level which file belongs to which user. This can be as easy as storing all files for a user in a directory with their UUID. Do not use names or emails there. No personally identifiable data should be stored anywhere else except in your database. Even there, only if needed and it should be encrypted.

So, it goes something like this:

Desktop app sends a message to your server asking to upload a file. Something like "I need to upload a 3.7 GB file". The message is encrypted before being sent with the public key of that user.
Your server gets the message, decrypts it, checks space available, looks for the proper provider in its database and retrieves the latest access_key/secret_key for that provider.
Your server sends something like "ALL_GOOD, upload to provider_AWS_S3, using THIS_ACCESS_KEY paired with THIS_SECRET_KEY". Message is encrypted using the private key.
The desktop app uploads the file directly to S3 using the provided keys.

Download and other operations should be made in a similar manner.

Great use case for serverless (Lambda on AWS, Google functions, etc.), which should reduce the costs and provide increased redundancy and "uptime".

Improvements can be made and there are pitfalls. Encrypting files client side before upload would add an extra layer of security, for example. But this post is too long already.

There you go. That'll be $3000 :).

Architecture for uploading large files from many end points to the cloud storage

Answers (1)

Related Questions