Viren
Viren

Reputation: 180

What are the options to load HDFS using Kafka?

What are the options/solutions to load HDFS using Apache Kafka in current tech landscape?

I am looking for options on Consumer end of Kafka here. Also looking for something that scales to at least few terabytes per day.

I also have a few basic requirements: a) Output to HDFS should be partitioned. b) Records on Kafka may not be strictly chronological but output should be (as much as possible). c) Reliable in case of system outages (Network partitions, sw/hw crashes etc)

I looked through StackOverflow, but many of the Q&As are dated. Hence this question.

Upvotes: 1

Views: 740

Answers (2)

OneCricketeer
OneCricketeer

Reputation: 191963

Before the Confluent HDFS Connector, there was a product called Camus, which you can still find under LinkedIn Github. That project has since been moved to the Apache Gobblin project.

As far as the dated posts you may have found, Apache Flume or Storm still exist, and seem to be the only built-in streaming options for Cloudera environments.

Hortonworks offers Apache Nifi

Streamsets offers a Cloudera Parcel.

Flink and Spark work, but require some level of knowledge to reliably scale, maintain, and debug those custom processes (as compared to simple config files in Connect, Camus/Gobblin, Flume).

Depending on options available in your environment, while I personally don't have much experience with Fluentd or Logstash, I know they have Kafka and HDFS configuration options


From what I've worked with, Connect & Camus offer the most flexible partition options (even if you need to add a custom partitioner yourself, the Partitioner interface is very simple). Flume probably is similar, though, I've not used it.

Nifi and Streamsets don't require deployment of any JAR files, which has its benefits.

Storm/Spark/Flink would all of course need to be written in such a way that partitions are created.


Reliability and delivery guarantees should be handled partially at the broker and the consumer sides via offset management and topic retention. In general, most consumer processes will give you "at least once" consumption

Upvotes: 3

Giorgos Myrianthous
Giorgos Myrianthous

Reputation: 39930

In order to move data between Kafka and Hadoop HDFS you can use the HDFS connector of Kafka Connect. The documentation of the connector can be found here.

Regarding your requirements:

a) In order to configure partitions you need to have a look at the partitioner configuration in the documentation

b) There are some order guarantees in Kafka. It only provides a total order over messages within a partition, not between different partitions in a topic. If for example you need to make sure that the messages related to a particular user are ordered, then you can assign the key (e.g. user_id) to the message so that all the messages with the same key are put into the same partition and therefore their order is guarantee.

c) High availability is provided by Kafka out of the box (assuming that you have set up the required broker and resources correctly). For a more complete answer regarding high availability and data loss, see my answer to this question on SO.

Upvotes: 2

Related Questions