tim peterson
tim peterson

Reputation: 24305

Store each AWS S3 file in a database as a separate row?

I know alot has been said on SO on how a file should be represented in a database but I couldn't find any Stackoverflow questions which went deeper into how multiple related files should be stored.

I'm using Amazon S3 and have grouped images into albums (i.e., "folders") inside a single S3 bucket. I've read that it is a good practice to at least store the file path in one's database.

My question is what do with multiple files all with the same "folder" path. Here's my S3 structure:

my-bucket/folder1/img1a.jpg
my-bucket/folder1/img1b.jpg

my-bucket/folder2/img2a.jpg
my-bucket/folder2/img2b.jpg

Some questions:

  1. Should I represent this with 2 or 4 rows in my database?
  2. If each image is actually stored in S3 as multiple images of different sizes (40x40, 480x320), how might it be best to keep that information in my database and in my bucket?
  3. Looking at the AWS S3 SDK, I couldn't figure out how to get all the file URLs in a particular "folder". Am I missing something?

Upvotes: 3

Views: 6502

Answers (2)

Mike Brant
Mike Brant

Reputation: 71384

First, from the earlier answer and conversation, I would say, don't worry about billions of rows, until you have the problem to contend with. If you are just designing some brand new service, there is likely no need to worry about how you are going to manage billions of images right of the bat. Trying to deal with a highly available, low latency service that can serve billions of files is a design challenge that some of the best engineers in the world might take years to design and implement.

Perhaps focus a few orders of magnitude lower to think about how you are going to deal with millions or tens of millions of records or whatever is a realistic level of object you will need to manage in the next year or two. In this case, there really is no reason that, for example, a MySQL installation with well designed indexes could not handle querying on tables with millions of rows with good response times, particularly if you understand the access pattern and are able to cache frequently requested file metadata.

As far as whether a relational database is the best way to store your file metadata, really depends on the hierarchy of the data you are going to store and what your access pattern is going to be (i.e. how you are going to look up the data). You gave a very rudimentary example of how your files are going to be organized and suggested that there may be some organizational structure where each image is stored at multiple resolutions.

Does you application need to understand what all the resolution options are for an image and decide the best one to serve up based on some criteria, or will you always know the exact image that you are going to retrieve?

In the first case, you might want a NoSQL type storage for your metadata so that you can look up the image group and use application logic to select the best image file from the group. In the latter case, you might be better served to use a relational database or even a highly available key value store like SimpleDB or similar to get at the file metadata.

Also with regard to actually serving up the images, you might want to consider actually using Cloudfront to serve your S3 files, as that will give you some latency advantages as well.

With regards to your question about "folders" in S3, it is important to understand that there are not really folders in S3. People have commonly named their files with folder-like naming schemes to perhaps suggest some hierarchical grouping for files within the bucket, but there really is no physical directory structure nor the ability to do things typically associated with directory structures (like list all files in a directory). All files exist at the bucket level only.

Here's a files table (if using SQL or variant):

file_id  folder_id     file_path
  1          1       http://s3.aws.amazon.com/my-bucket/folder1/img1a.jpg
  2          1       http://s3.aws.amazon.com/my-bucket/folder1/img1b.jpg
  3          2       http://s3.aws.amazon.com/my-bucket/folder2/img2a.jpg
  4          2       http://s3.aws.amazon.com/my-bucket/folder2/img2b.jpg

Here, file_id would be primary key with autoincrement field and folder_id would be an int column with index to provide an easy way to lookup all files in a certain folder.

Upvotes: 3

jcolebrand
jcolebrand

Reputation: 16035

From what you're asking, it looks like you should have a "filepaths" table that has two items: a file id, and a filepath.

Then you have 4 rows in your database for the paths, and 1 row for the file itself, the metadata you're tracking.


You're conflating questions about Amazon services and database design. To that end, when it comes to:

If each image is actually stored in S3 as multiple images of different sizes (40x40, 480x320), how might it be best to keep that information in my database and in my bucket?

Looking at the AWS S3 SDK, I couldn't figure out how to get all the files in a particular "folder". Am I missing something?

I don't know anything about programming for Amazon webservices. I can say that you probably can't get them all in a specific folder, as they probably shard internally specifically to avoid the issues that you get by duplicating one record in your database up to four times.

As for how to store that information in your db and your bucket, I can only say "match your business needs"

Upvotes: 1

Related Questions