Can I use Apache Kafka as for Batch Layer to save historical data in Lambda Architecture?

Question

Kafka as a storage system can be a data store for long term data. It can replicate and distribute without problem. So can I create RDD from all historical data in Kafka and create batch view then combine it with Spark Streaming Views?

Vidya · Accepted Answer

tl;dr Yes, but why?

According to Nathan Marz, formerly of Twitter and author of the Lambda Architecture, these are the storage requirements for the master dataset in the Batch Layers:

"Efficient appends of new data." It has to be easy to add to the master dataset.
"Scalable storage." The Batch Layer needs to hold all the data the architecture has ever seen "forever," which could get up to the petabytes depending on your situation.
"Support for parallel processing." The batch views that make it to the Serving Layer require applying functions to the master dataset, so these have to run in parallel so they finish before the apocalypse is upon us.
"Enforceable immutability." It's critical to put checks in place to prevent mutations on the raw data, which is the source of truth for everything you do.
"Tunable storage and processing costs." The batch layer needs to give you the flexibility to decide how to store and compress your data at rest and in computations.

Kafka satisfies all of these, so technically it could indeed store the master dataset in your Batch Layer.

However, the Kappa Architecture, devised by Jay Kreps (formerly of LinkedIn) is a lot easier to work with than the Lambda Architecture--and I would say more effective at satisfying modern use cases like IoT. All you need to make it happen is distributed, scalable, immutable, configurable streaming, which is exactly what Kafka provides. So why not just do that?

To use Kafka for data storage in the Batch Layer of the Lambda Architecture is to underutilize its capability--and for the sole purpose of forcing it into an architecture that is actually less effective over time.

Can I use Apache Kafka as for Batch Layer to save historical data in Lambda Architecture?

Answers (1)

Related Questions