Glide
Glide

Reputation: 21245

What benefits are there to use Kafka Connect to write to S3 from Kafka?

I'm just curious if it's not straight forward to write my own code to read from Kafka using the Kafka Consumer API and use the AWS SDK to write to S3? Are there a lot of non-obvious complications to deal with?

I'm asking since Kafka Connect seems to be the most suggested way to write to S3 from Kafka.

Upvotes: 4

Views: 2293

Answers (2)

Konstantine Karantasis
Konstantine Karantasis

Reputation: 1993

You might have seen it mentioned before as an analogy, so I'll use it here too: you may think of Connect as a high-level framework of Kafka producers and consumers that aim to integrate your data with Kafka using Sources and Sinks (the high-level equivalents of producers and consumers respectively in Connect). A variety of such Sources and Sinks, briefly Connectors, in already available.

Specifically, with respect to data export from Kafka to Amazon S3, there are a few connectors already available, and since I'm in part responsible for the latest one, allow me to mention a few advantages in using it. (Hopefully this will answer your question about whether it is more or less straightforward to implement all these features from scratch).

I will group my arguments in comparison to writing a program directly based on consumers roughly in two categories:

Pros offered by the Connect Framework

  • Transparent and scalable execution on a cluster.
  • Fault tolerant execution, same as with groups of Kafka consumers (the advantage is that get fault-tolerance without having to write the code)
  • A REST interface to start and stop connectors.
  • A small set of metrics (which will be expanded soon to a full set of performance and operational metrics).
  • Overall, define simple and intuitive streaming data flows that include Sources, simple transformations on your data (SMTs) and Sinks.

Pros offered by the S3 Connector

  • Multiple formatters (currently exporting binary .avro files and text .json files)
  • Support of structured or unstructured data, with modes for schema evolution for the former.
  • A gamut of partitioners: Size, time, or field based, which you can use as base classes to build your own custom partitioners that fit your use case, if they don't do exactly what you want out-of-the-box.
  • Exactly-once semantics for most of the use cases of the partitioners above (meaning that, if you reprocess your data, or you recover from a failure, you won't see duplicate records in S3).
  • Easily configurable.
  • Active support from the community (which your classes might also end up having if you open-source them).

Overall, you won't have to write from scratch and maintain code that many others (like you) want to use. Furthermore, if you find that one or more features are missing, you can contribute these features in the open source S3 Connector.

Upvotes: 3

Matthias J. Sax
Matthias J. Sax

Reputation: 62330

There are couple of advantages:

  • Connect can be deployed in a distributed fashion and thus scales
  • Connect is fault-tolerant
  • You just configure the connector and use it (no coding required)
  • If you update, you don't need to update any code (you did not write any)

Of course, you can write your own consumer application that write to S3, but why re-invent the wheel?

Upvotes: 1

Related Questions