user16344431
user16344431

Reputation:

What's the best way to organize small files in Amazon S3?

I have an airflow job that calls an API endpoint every 5 mins (24 x 12 calls per day). The API response is a JSON with six items in it (~ 1KB). I am storing each response as a separate file in Amazon S3.

current s3 organization

s3://bucket/data/
                1/1/2021/ 
                        ---  288 .json files (one file per every 5 mins)
                1/2/2021/
                        -- 288 .json files

In this is approach, there are a lot of small files in s3. Is there any better approach that I can implement to handle this issue of small files?

Upvotes: 1

Views: 573

Answers (1)

John Rotenstein
John Rotenstein

Reputation: 269470

One option is to send to an Amazon Kinesis Firehose stream instead of storing a file. The Kinesis Firehose Stream can batch data by size or time, such as saving data into a file every 5 minutes, or every 5 MB.

Another option is to run a daily job (or more often) that combines the data from those files into a single file. Depending upon the data format, this could be done with Amazon Athena. Depending upon how you want to use the saved data, it also provides the opportunity to change the data format when combining the files. The best format for later querying would be Snappy-compressed Parquet files, which can be queried fast and cheap by Amazon Athena.

Atlassian does the latter -- they have a job that combines all files received during the day into daily batched files. See: Socrates: Atlassian's Data Lake

Upvotes: 1

Related Questions