Big Data Batch and Stream Data pipeline with Hadoop Spark

Question

I am designing this below flow and want to know if am going in the right way. i want to skip any unwanted steps if added. I have Hadoop running on spark engine.

below pipeline has to pull the data in batch as well as streaming data.
flow needs to pull the data and store in HDFS
hive with mysql for metedata storage to run hive queries
then need to perform complex ETL operations using pyspark
then load the transformed data in rdbms
data from rdbms will be reported using Apache Superset
the complete flow should be run by a scheduler, for that using airflow can you please check and suggest if am missing something in order to make this flow robust

Big Data Batch and Stream Data pipeline with Hadoop Spark

Answers (1)

Related Questions