Reputation: 1043
When I create Kinesis Data Firehose stream, there are 2 options for the Source,
What are the advantage and disadvantage of these options?
Upvotes: 6
Views: 7426
Reputation: 238847
They serve different purposes. But if your aim is only to inject records for storing (and transforming optionally) in S3, Redshift or ElasticSearch, then the main difference is simplicity.
Direct PUT or other sources
Allows for direct "manual" injection of records into firehose
. For the ingestion, you or your application have to use put-record or put-record-batch.
These api calls are very simple and straightforward to use, in a sense you don't need to manage records partitioning. Because you just provide them with firehose
name and the record(s) to be written. Nothing else is reacquired.
Also firehose
is basically serverless, thus you do not need to manage its scaling or provision its throughput. Its all done automatically for you.
However, firehose
is not completely "real-time". Due to its timeout and buffering your records always get delayed.
Kinesis Data Stream
If you front your firehose
with kinesis stream
, then you have to inject records to the stream. For that you use put-record and or put-records. If you look at these api calls, they are more complicated as you have to manage key partitioning
yourself. You have to do it correctly, as otherwise you end up with hot/cold shards and worries how to fix that.
Also data streams
are not serverless in a sense that they do not autoscale. You have to manage their throughput yourself. This means that you have to calculate and provision the number of shards you require. If you do it incorrectly, you will have issues.
Conclusions
Choose direct put to firehose
if you only aim at storing (transforming) your records in supported storage destinations.
Choose to use kinesis data stream in front of firehose
if you require not only storing, but also doing other things with your records in real-time. This is because you can have other stream consumers than firehose
which do require real-time data.
Upvotes: 10
Reputation: 4526
The primary difference (assuming that you're not attaching anything else to the stream) is that you need a support ticket to scale Firehose.
Per the docs:
When Direct PUT is configured as the data source, each Kinesis Data Firehose delivery stream provides the following combined quota for PutRecord and PutRecordBatch requests:
For US East (N. Virginia), US West (Oregon), and Europe (Ireland): 5,000 records/second, 2,000 requests/second, and 5 MiB/second.
For US East (Ohio), US West (N. California), AWS GovCloud (US-East), AWS GovCloud (US-West), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (London), Europe (Paris), Europe (Stockholm), Middle East (Bahrain), and South America (São Paulo): 1,000 records/second, 1,000 requests/second, and 1 MiB/second.
To request an increase in quota, use the Amazon Kinesis Data Firehose Limits form.
Kinesis streams, by comparison, let you control scaling based on the number of shards: 1,000 records per second or 1 MB/second ingest per shard. If you discover that you need more capacity, you can easily increase the number of shards.
Another difference is that Firehose only retains records for 24 hours if the destination is unavailable, while a Kinesis stream can be configured to retain records for up to a week.
For a robust architecture, I recommend using a combination of Kinesis streams for ingest, and Firehose for batching and writing to the destination.
Upvotes: 2