What is use of kafka in Big Data cluster?

I have recently deployed Big Data cluster. In that I've used Apache Kafka and zookeeper. But still I didn't understand about its usage in cluster. When both are required and for what purpose?

Upvotes: 0

Answers (3)

Ravindra babu

Reputation: 38950

I am simplifying the concepts here. You can find detailed explanation at this article

Kafka is a fast, scalable, distributed in nature by its design, partitioned and replicated commit log service. It has a unique design.

A stream of Messages of a particular type is defined as a Topic.

A Producer can be anyone who can publish messages to a Topic.

The published messages are then stored at a set of servers called Brokers or Kafka Cluster.

A Consumer can subscribe to one or more Topics and consume the published Messages by pulling data from the Brokers.

ZooKeeper is a distributed, hierarchical file system that facilitates loose coupling between clients.

ZooKeeper achieves high availability by running multiple ZooKeeper servers, called an ensemble.

ZooKeeper is used for managing, coordinating Kafka broker.

Each Kafka broker is coordinating with other Kafka brokers using ZooKeeper.

Producer and consumer are notified by ZooKeeper service about the presence of new broker in Kafka system or failure of the broker in Kafka system.

Upvotes: 1

Morgan Kenyon

Reputation: 3172

In regards to Kafka, I would add a couple things.

Kafka describes itself as being a log not a queue. A log is an append-only, totally-ordered sequence of records ordered by time.

In a strict data structures sense, a queue is FIFO collection that is designed to hold data, but then once it is taken out of the queue there's no way to get it back. Jaco does describe it has being a persistent queue, but using different terms (queue v. log) can help in understanding.

Kafka's log is saved to disk instead of being kept in memory. The designers of Kafka have chosen this because 1. They wanted to avoid a lot of the JVM overhead you get when storing things in data structures. 2. They wanted messages to persist even if the Java process dies for some reason.

Kafka is designed for multiple consumers (Kafka term) to read from the same logs. Each consumer tracks its own offset in the log, Consumer A might be at offset 2, Consumer B might be at offset 8, etc. Tracking Consumers by offset eliminates a lot of complexities from Kafka's side.

Reading that first link will explain a lot of the differences between Kafka and other messaging services.

Upvotes: 0

Alex

Reputation: 21766

Kafka is a distributed messaging system optimised for high throughput. It is has a persistent queue with messages being appended to to files with on disk structures and performs consistently, even with very modest hardware. In short you will use Kafka to load data into your big data clusters and you will be able to do this at a high speed even when using modest hardware because of the distributed nature of Kafka.

Regarding Zookeeper, its a centralized distributed configuration service and naming registry for large distributed systems. It is robust, since the persisted data is distributed between multiple nodes and one client connects to any of them , migrating if one node fails; as long as a strict majority of nodes are working. So in short, Zookeeper makes sure your big data cluster remains online even if some of its nodes are offline.

Upvotes: 0

What is use of kafka in Big Data cluster?

Answers (3)

Related Questions