userRandom
userRandom

Reputation: 160

where is data stored in hadoop

Though I have understood the architecture of hadoop a bit , I have some void in understanding of where the data is exactly situated.

My question is like " Suppose I have large data of some random books .. is the data of books stored in multiple Nodes previously using HDFS and we perform MapReduce on each node and get the result in our system ?

'OR'

Do we store data some where in large database and whenever we want to perform the MapReduce operation, we take the chunks and store them in multiple Nodes for performing operation ?

Upvotes: 1

Views: 3317

Answers (1)

AaronM
AaronM

Reputation: 2459

Either is possible, it really depends on your use case and needs. However, generally Hadoop MapReduce runs against data stored in HDFS. The system is designed around data locality which requires the data be in HDFS. That is the Map tasks run on the same piece of hardware where the data is stored in order to improve performance.

That said if for some reason your data must be stored outside of HDFS and then processed using MapReduce it can be done but is a bit more work and is not as efficient as processing data in HDFS locally.

So lets take two use cases. Start with log files. Log files as they are are not particularly accessible. They just need to be stuck somewhere and stored for later analysis. HDFS is perfect for this. If you really need a log back out you can get it but generally people will be looking for the output of the analytics. So store your logs in HDFS and process them normally.

However, data in the format ideal for HDFS and Hadoop Map Reduce (many records in a single large flat file) is not what I would consider highly accessible. Hadoop Map Reduce expects to have input files that are multi megabytes in size with many records per file. The more you diverge from this case, the more your performance will decline. Sometimes your data is needed online at all times, and HDFS is not ideal for this. For instance we will use your book example. If these books are used in an application that needs the content accessible in an online fashion, I.E. editting and annotating, you may choose to store them in a database. Then when you need to run batch analytics you use a custom InputFormat to retrieve the records from the database and process them in MapReduce.

I am currently doing this with a web crawler that stores the web pages individually in Amazon S3. Web pages are too small to serve as a single efficient input to MapReduce, so I have a custom InputFormat that feeds each mapper several files. The output of this MapReduce job is eventually written back to S3, and because I am using Amazon EMR, the Hadoop cluster goes away.

Upvotes: 3

Related Questions