Reputation: 2859
Kafka as a storage system can be a data store for long term data. It can replicate and distribute without problem. So can I create RDD from all historical data in Kafka and create batch view then combine it with Spark Streaming Views?
Upvotes: 0
Views: 496
Reputation: 30310
tl;dr Yes, but why?
According to Nathan Marz, formerly of Twitter and author of the Lambda Architecture, these are the storage requirements for the master dataset in the Batch Layers:
Kafka satisfies all of these, so technically it could indeed store the master dataset in your Batch Layer.
However, the Kappa Architecture, devised by Jay Kreps (formerly of LinkedIn) is a lot easier to work with than the Lambda Architecture--and I would say more effective at satisfying modern use cases like IoT. All you need to make it happen is distributed, scalable, immutable, configurable streaming, which is exactly what Kafka provides. So why not just do that?
To use Kafka for data storage in the Batch Layer of the Lambda Architecture is to underutilize its capability--and for the sole purpose of forcing it into an architecture that is actually less effective over time.
Upvotes: 1