Why do we need kafka to feed data to apache spark

Question

I am reading about spark and its real-time stream processing.I am confused that If spark can itself read stream from source such as twitter or file, then Why do we need kafka to feed data to spark? It would be great if someone explains me what advantage we get if we use spark with kafka. Thank you.

S&#246;nke Liebau · Accepted Answer

Kafka offers a decoupling and buffering of your input stream.

Take Twitter data for example, afaik you connect to the twitter api and get a constant stream of tweets that match criteria you specified. If you now shut down your Spark jobs for an hour do to some mainentance on your servers or roll out a new version, then you will miss tweets from that hour.

Now imagine you put Kafka in front of your Spark jobs and have a very simple ingest thread that does nothing but connect to the api and write tweets to Kafka, where the Spark jobs retrieve them from. Since Kafka persists everything to disc, you can shut down your processing jobs, perform maintenance and when they are restarted, they will retrieve all data from the time they were offline.

Also, if you change your processing jobs in a significant way and want to reprocess data from the last week, you can easily do that if you have Kafka in your chain (provided you set your retention time high enough) - you'd simply roll out your new jobs and change the offsets in Kafka so that your jobs reread old data and once that is done your data store is up to date with your new processing model.

There is a good article on the general principle written by Jay Kreps, one of the people behind Kafka, give that a read if you want to know more.

Why do we need kafka to feed data to apache spark

Answers (2)

Related Questions