Reputation:
I have an airflow job that calls an API endpoint every 5 mins (24 x 12 calls per day). The API response is a JSON with six items in it (~ 1KB). I am storing each response as a separate file in Amazon S3.
current s3 organization
s3://bucket/data/
1/1/2021/
--- 288 .json files (one file per every 5 mins)
1/2/2021/
-- 288 .json files
In this is approach, there are a lot of small files in s3. Is there any better approach that I can implement to handle this issue of small files?
Upvotes: 1
Views: 573
Reputation: 269470
One option is to send to an Amazon Kinesis Firehose stream instead of storing a file. The Kinesis Firehose Stream can batch data by size or time, such as saving data into a file every 5 minutes, or every 5 MB.
Another option is to run a daily job (or more often) that combines the data from those files into a single file. Depending upon the data format, this could be done with Amazon Athena. Depending upon how you want to use the saved data, it also provides the opportunity to change the data format when combining the files. The best format for later querying would be Snappy-compressed Parquet files, which can be queried fast and cheap by Amazon Athena.
Atlassian does the latter -- they have a job that combines all files received during the day into daily batched files. See: Socrates: Atlassian's Data Lake
Upvotes: 1