johny
johny

Reputation: 61

Does Apache Kafka Store the messages internally in HDFS or Some other File system

We have a project requirement of testing the data at Kafka Layer. So JSON files are moving into hadoop area and kafka is reading the live data in hadoop(Raw Json File). Now I have to test whether the data sent from the other system and read by kafka should be same.

Can i validate the data at kafka?. Does kafka store the messages internally on HDFS?. If yes then is it stored in a file structure similar to what hive saves internally just like a single folder for single table.

Upvotes: 3

Views: 9911

Answers (3)

Matthias J. Sax
Matthias J. Sax

Reputation: 62350

Kafka stores data in local files (ie, local file system for each running broker). For those files, Kafka uses its own storage format that is based on a partitioned append-only log abstraction.

The local storage directory, can be configured via parameter log.dir. This configuration happens individually for each broker, ie, each broker can use a different location. The default value is /tmp/kafka-logs.

The Kafka community is also working on tiered-storage, that will allow brokers to no only use local disks, but to offload "cold data" into a second tier: https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage

Furthermore, each topic has multiple partitions. How partitions are distributed, is a Kafka internal implementation detail. Thus you should now rely on it. To get the current state of your cluster, you can request meta data about topics and partitions etc. (see https://cwiki.apache.org/confluence/display/KAFKA/Finding+Topic+and+Partition+Leader for an code example). Also keep in mind, that partitions are replicated and if you write, you always need to write to the partition leader (if you create a KafkaProducer is will automatically find the leader for each partition you write to).

For further information, browse https://cwiki.apache.org/confluence/display/KAFKA/Index

Upvotes: 6

ketankk
ketankk

Reputation: 2674

This happens with most of the beginner. Let's first understand that component you see in Big Data processing may not be at all related to Hadoop.

Yarn, MapReduce, HDFS are 3 main core component of Hadoop. Hive, Pig, OOOZIE, SQOOP, HBase etc work on top of Hadoop.

Frameworks like Kafka or Spark are not dependent on Hadoop, they are independent entities. Spark supports Hadoop, like Yarn, can be used for Spark's Cluster mode, HDFS for storage.

Same way Kafka as an independent entity, can work with Spark. It stores its messages in the local file system.

log.dirs=/tmp/kafka-logs

You can check this at $KAFKA_HOME/config/server.properties

Hope this helps.

Upvotes: 0

Ahmed Abdelrahman
Ahmed Abdelrahman

Reputation: 129

I think you can, but you have to do that manually. You can let kafka sink whatever output to HDFS. Maybe my answer is a bit late and this 'confluent' reference appeared after that, but briefly one can do the followings:

  • Assuming you have all servers are running (check the confluent website)
  • Create your connector:

    name=hdfs-sink
    
    connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
    
    tasks.max=1
    
    topics='your topic'
    
    hdfs.url=hdfs://localhost:9000
    
    flush.size=3
    
  • Note: The approach assumes that you are using their platform (confluent platform) which I haven't use.

  • Fire the kafka-hdfs streamer.

Also you might find more useful details in this Stack Overflow discussion.

Upvotes: 1

Related Questions