Does Apache Kafka Store the messages internally in HDFS or Some other File system

We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.

Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.

Upvotes: 3

Answers (3)

Matthias J. Sax

Reputation: 62350

Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.

The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.

The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage

Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).

For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index

Upvotes: 6

ketankk

Reputation: 2674

This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.

Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.

Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.

Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.

log.dirs=/tmp/kafka-logs

You can check this at $KAFKA_HOME/config/server.properties

Hope this helps.

Upvotes: 0

Ahmed Abdelrahman

Reputation: 129

I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:

Assuming you have all servers are running (check the confluent website)

Create your connector:

name=hdfs-sink

connector.class=io.confluent.connect.hdfs.HdfsSinkConnector

tasks.max=1

topics='your topic'

hdfs.url=hdfs://localhost:9000

flush.size=3

Note: The approach assumes that you are using their platform (confluent platform) which I haven't use.
Fire the kafka-hdfs streamer.

Also you might find more useful details in this Stack Overflow discussion.

Upvotes: 1

Does Apache Kafka Store the messages internally in HDFS or Some other File system

Answers (3)

Related Questions