Buffalo
Buffalo

Reputation: 4042

Storing many small files (on S3)?

I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time. Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.

This means each day I'm storing an additional 200-300GB worth of files.

Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.

While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:

PUT requests to Glacier $0.05 per 1,000 requests

Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests

Is there a way of gluing the files together, but keeping them accessible individually?

Upvotes: 9

Views: 3308

Answers (3)

Derrops
Derrops

Reputation: 8127

Your files are just too small. You will need to combine them probably in an ETL pipeline such as glue. You can also use the Range header i.e. -range bytes=1000-2000 to download part of an object on S3.

If you do that you'll need to figure out the best way to track the bytes ranges, such as after combining the files recording the range for each one, and changing the clients to use the range as well.

The right approach though depends on how this data is accessed and figuring out the patterns. If somebody who looks at TinyFileA also looks at TinyFileB you could combine them together and just send them both along with other files they are likely to use. I would be figuring out logical groupings of files which make sense to consumers and will reduce the number of requests they need, without sending too much irrelevant data.

Upvotes: 2

user13025316
user13025316

Reputation:

Glacier would be extremely cost sensitive when it comes to the number of files. The best method would be to create a Lambda function that handles zip, unzip operations for you.

Consider this approach:

  • Lambda creates archive_date_hour.zip of the 2 Million files from that day by hour, this solves the "per object" cost problem by creating 24 giant archival files.
  • Set a policy on the s3 bucket to move expired objects to glacier over 1 day old.
  • Use an unzipping Lambda function to fetch and extract potential hot items from the glacier bucket from within the zip files.
  • Keep the main s3 bucket for hot files with high frequent access, as a working directory for the zip/unzip operations, and for collecting new files daily

Upvotes: 3

wowkin2
wowkin2

Reputation: 6355

An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.

Having that I'd suggest following:

  • as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
  • make some index file or database to know where each html-file is stored;
  • read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.

Upvotes: 5

Related Questions