Reputation: 41945
I have an AWS Kinesis Firehose stream putting data in s3 with the following config:
S3 buffer size (MB)* 2
S3 buffer interval (sec)* 60
Everything works fine. The only problem is that Firehose creates one s3 file for every chunk of data. (In my case, one file every minute, as in the screenshot). Over time, this is a lot of files: 1440 files per day, 525k files per year.
This is hard to manage (for example if I want to copy the bucket to another one I would need to copy every single file one by one and this would take time).
Two questions:
COPY
ing from a plethora of s3 files versus just a few ? I haven't measured this precisely, but in my experience performance with a lot of small files is quite worse. From what I can recall, when using big files, a COPY of about 2M rows is about ~1minute. 2M rows with lots of small files (~11k files), it takes up to 30minutes.My two main concerns are:
Upvotes: 15
Views: 10352
Reputation: 908
I really like this solution by @psychorama. Infact I could do the same with my project where I was about to give up on the firehose solution. Since I am reading data from dynamodb and putting them into kinesis firehose, I can actually club the whole batch of data from dynamodb into one record with the limit, and then send it to firehose. But not sure if this solution would be easy to implement. Maybe in the 2nd version
Upvotes: 0
Reputation: 323
I faced a similar problem where the number of files were too many to handle. Here's a solution that can be useful :
i) Increase the size buffer to max. (128 MB)
ii) Increase the time buffer to max. (900 secs)
iii) Instead of publishing one record at a time, club multiple records in one (by separating them by a line), to make one kinesis firehose record (max size of a KF record is : 1000 KB).
iv) Also, club multiple kinesis firehose records to form a batch and then do a batch put. (http://docs.aws.amazon.com/firehose/latest/APIReference/API_PutRecordBatch.html)
This will make one s3 object published to be : number of batches that the kinesis firehose stream can hold.
Hope this helps.
Upvotes: 2
Reputation: 12929
Kinesis Firehose is designed to allow near real time processing of events. It is optimized for such use cases, and therefore you have such setting as smaller and more frequent files. This way you will get the data faster for queries in Redshift, or more frequent invocations of Lambda functions on the smaller files.
It is very common for customers of the service to also prepare the data for longer historical queries. Even if it is possible to run these long term queries on Redshift, it might make sense to use EMR for these queries. You can then keep your Redshift cluster tuned for the more popular recent events (for example, a "hot" cluster for 3 months on SSD, and "cold" cluster for 1 year on HDD).
It make sense that you will take the smaller (uncompressed?) files in the Firehose output S3 bucket, and transfer them to a more EMR (Hadoop/Spark/Presto) optimized format. You can use services such as S3DistCp, or a similar function that will take the smaller file, concatenate them and transform their format to a Parquet format.
Regarding the optimization for the Redshift COPY, there is a balance between the time that you aggregate the events and the time that it takes to COPY them. It is true that it is better to have larger files when you copy to Redshift, as there is a small overhead for each file. But on the other hand, if you are COPYing the data only every 15 minutes, you might have "quiet" times that you are not utilizing the network or the ability of the clusters to ingest events between these COPY commands. You should find the balance that is good for the business (how fresh do you need your events to be) and the technical aspects (how many events can you ingest in an hour/day to your Redshift).
Upvotes: 2
Reputation: 46879
The easiest fix for you is going to be to increase the firehose buffer size and time limit - you can go up to 15 minutes which will cut your 1440 files per day down to 96 files a day (unless you hit the file size limit of course).
Beyond that, there is nothing in Kinesis that will concat the files for you, but you could setup an S3 lifecycle event that fires each time a new Kinesis file is created and add some of your own code to (maybe running on EC2 or go serverless with Lambda) and do the concatenation yourself.
Can't comment on the redshift loading performance, but I suspect it's not a huge deal, if it was - or will become one, I suspect AWS will do something about the performance since this is the usage pattern they setup.
Upvotes: 8