Waqar Ahmed
Waqar Ahmed

Reputation: 5068

Why do we need kafka to feed data to apache spark

I am reading about spark and its real-time stream processing.I am confused that If spark can itself read stream from source such as twitter or file, then Why do we need kafka to feed data to spark? It would be great if someone explains me what advantage we get if we use spark with kafka. Thank you.

Upvotes: 13

Views: 4266

Answers (2)

ketankk
ketankk

Reputation: 2674

Kafka decouples everything,Consumer-Producer need not to know about each other. Kafka provides pub-sub model based on topic.

From multiple sources you can write data(messages) to any topic in kafka, and consumer(spark or anything) can consume data based on topic.

Multiple consumer can consume data from same topic as kafka stores data for period of time.

But at the end, it's depends ion your use-case if you really need a broker.

Upvotes: 0

Sönke Liebau
Sönke Liebau

Reputation: 1973

Kafka offers a decoupling and buffering of your input stream.

Take Twitter data for example, afaik you connect to the twitter api and get a constant stream of tweets that match criteria you specified. If you now shut down your Spark jobs for an hour do to some mainentance on your servers or roll out a new version, then you will miss tweets from that hour.

Now imagine you put Kafka in front of your Spark jobs and have a very simple ingest thread that does nothing but connect to the api and write tweets to Kafka, where the Spark jobs retrieve them from. Since Kafka persists everything to disc, you can shut down your processing jobs, perform maintenance and when they are restarted, they will retrieve all data from the time they were offline.

Also, if you change your processing jobs in a significant way and want to reprocess data from the last week, you can easily do that if you have Kafka in your chain (provided you set your retention time high enough) - you'd simply roll out your new jobs and change the offsets in Kafka so that your jobs reread old data and once that is done your data store is up to date with your new processing model.

There is a good article on the general principle written by Jay Kreps, one of the people behind Kafka, give that a read if you want to know more.

Upvotes: 14

Related Questions