Takayuki Sato
Takayuki Sato

Reputation: 1043

Kinesis Data Firehose source `Direct PUT` vs `Kinesis Data Stream`

When I create Kinesis Data Firehose stream, there are 2 options for the Source,

What are the advantage and disadvantage of these options?

Upvotes: 6

Views: 7426

Answers (2)

Marcin
Marcin

Reputation: 238847

They serve different purposes. But if your aim is only to inject records for storing (and transforming optionally) in S3, Redshift or ElasticSearch, then the main difference is simplicity.

Direct PUT or other sources

Allows for direct "manual" injection of records into firehose. For the ingestion, you or your application have to use put-record or put-record-batch.

These api calls are very simple and straightforward to use, in a sense you don't need to manage records partitioning. Because you just provide them with firehose name and the record(s) to be written. Nothing else is reacquired.

Also firehose is basically serverless, thus you do not need to manage its scaling or provision its throughput. Its all done automatically for you.

However, firehose is not completely "real-time". Due to its timeout and buffering your records always get delayed.

Kinesis Data Stream

If you front your firehose with kinesis stream, then you have to inject records to the stream. For that you use put-record and or put-records. If you look at these api calls, they are more complicated as you have to manage key partitioning yourself. You have to do it correctly, as otherwise you end up with hot/cold shards and worries how to fix that.

Also data streams are not serverless in a sense that they do not autoscale. You have to manage their throughput yourself. This means that you have to calculate and provision the number of shards you require. If you do it incorrectly, you will have issues.

Conclusions

Choose direct put to firehose if you only aim at storing (transforming) your records in supported storage destinations.

Choose to use kinesis data stream in front of firehose if you require not only storing, but also doing other things with your records in real-time. This is because you can have other stream consumers than firehose which do require real-time data.

Upvotes: 10

Parsifal
Parsifal

Reputation: 4526

The primary difference (assuming that you're not attaching anything else to the stream) is that you need a support ticket to scale Firehose.

Per the docs:

When Direct PUT is configured as the data source, each Kinesis Data Firehose delivery stream provides the following combined quota for PutRecord and PutRecordBatch requests:

  • For US East (N. Virginia), US West (Oregon), and Europe (Ireland): 5,000 records/second, 2,000 requests/second, and 5 MiB/second.

  • For US East (Ohio), US West (N. California), AWS GovCloud (US-East), AWS GovCloud (US-West), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (London), Europe (Paris), Europe (Stockholm), Middle East (Bahrain), and South America (São Paulo): 1,000 records/second, 1,000 requests/second, and 1 MiB/second.

To request an increase in quota, use the Amazon Kinesis Data Firehose Limits form.

Kinesis streams, by comparison, let you control scaling based on the number of shards: 1,000 records per second or 1 MB/second ingest per shard. If you discover that you need more capacity, you can easily increase the number of shards.

Another difference is that Firehose only retains records for 24 hours if the destination is unavailable, while a Kinesis stream can be configured to retain records for up to a week.

For a robust architecture, I recommend using a combination of Kinesis streams for ingest, and Firehose for batching and writing to the destination.

Upvotes: 2

Related Questions