smeeb
smeeb

Reputation: 29537

How does Hadoop get input data not stored on HDFS?

I'm trying to wrap my brain around Hadoop and read this excellent tutorial as well as perused the official Hadoop docs. However, in none of this literature can I find a simple explanation for something pretty rudimentary:

In all the contrived "Hello World!" (word count) introductory MR examples, the input data is stored directly in text files. However, to me, it feels like this would seldom be the case out in the real world. I would imagine that in reality, the input data would exist in large data stores, like a relational DB, Mongo, Cassandra, or only available via REST API, etc.

So I ask: In the real world, how does Hadoop get its input data? I do see that there are projects like Sqoop and Flume and am wondering if the whole point of these frameworks is to simply ETL input data onto HDFS for running MR jobs.

Upvotes: 5

Views: 576

Answers (1)

user4097444
user4097444

Reputation:

Actually HDFS is needed in the Real world application for many reasons.

  • Very high bandwidth to support Map Reduce workloads and Scalability.
  • Data reliability and fault tolerant. Due to replication and by distributed nature. Required for critical data systems.
  • Flexibility - You don't have to pre-process the data to store that in HDFS.

Hadoop is designed to be write once and read many concept. Kafka, Flume and Sqoop which are generally used for ingestion are themselves very fault tolerant and provide high-bandwidth for data ingestion to HDFS. Sometimes it is required to ingest data from thousands for sources per minute with data in GBs. For this these tools are required as well as fault tolerant storage system-HDFS.

Upvotes: 5

Related Questions