Reputation: 1955
I would like to understand how the communication between the Kafka and Spark(Streaming) nodes takes place. I have the following questions.
communication i mean here is whether it is a RPC or Socket communication. I would like to understand the internal anatomy
Any help appreciated.
Thanks in Advance.
Upvotes: 1
Views: 475
Reputation: 16910
First of all, it doesn't count if the Kafka nodes and Spark nodes are in the same cluster or not, but they should be able to connect to each other (open ports in firewall).
There are 2 ways to read from Kafka with Spark Streaming, using the older KafkaUtils.createStream()
API, and the newer, KafkaUtils.createDirectStream()
method.
I don't want to get into the differences between them, that is well documented here (in short, direct stream is better).
Addressing your question, how does the communication happen (internal anatomy): the best way to find out is looking at the Spark source code.
The createStream()
API uses a set of Kafka consumers, directly from the official org.apache.kafka
packages. These Kafka consumers have their own client called the NetworkClient
, which you can check here. In short, the NetworkClient
uses sockets for communicating.
The createDirectStream()
API does use the Kafka SimpleConsumer
from the same org.apache.kafka
package. The SimpleConsumer
class reads from Kafka with a java.nio.ReadableByteChannel
which is a subclass of java.nio.SocketChannel
, so in the end it is with done with sockets as well, but a bit more indirectly using Java's Non-blocking I/O convenience APIs.
So to answer your question: it is done with sockets.
Upvotes: 3