Salva
Salva

Reputation: 113

Big Data Batch and Stream Data pipeline with Hadoop Spark

I am designing this below flow and want to know if am going in the right way. i want to skip any unwanted steps if added. I have Hadoop running on spark engine.

enter image description here

Upvotes: 0

Views: 205

Answers (1)

OneCricketeer
OneCricketeer

Reputation: 191963

Use Debezium to pull from RDBMS. All writes therefore end up in Kafka, and you don't end up with "batches" at all. (Sqoop is a retired Apache project)

Use Apache Pinot or Druid to ingest Kafka directly. Then you don't need HDFS.

You can query Pinot / Druid using SQL. Or you can use Presto in place of Hive/SparkSQL, and you should be able to link SuperSet to Presto rather than an intermediate RDMBS.

Upvotes: 1

Related Questions