Reputation: 17
I am trying to understand how does the Map Reduce work in general. So what I know is that there are Mappers that run in parallel over several computers and create a resultset which is then used by Reducers running in parallel over several machines to create the intended data set.
My questions are:
Does one job run on a fixed number of files? So, at the start of a Job, there is a fixed number of files that need to be processed to process and produce some data.
If no, then how can we process a stream of Data that may be coming from different sources maybe Twitter feeds etc?
If Yes, Please explain how the Map Reduce find out when all the Mappers have finished and Reducing task should begin because possibly there is no point of reference.
Upvotes: 0
Views: 284
Reputation: 5063
Answers:
Yes. Basically a job starts, process files and ends. No running forever.
Stream processing can be handled by Storm or similar technologies but not Hadoop alone, since it's a batch processing system. You can also look for how Hadoop Yarn and Storm can work together.
The should be a point of reference, because tasktracker running in different nodes sends status info of different tasks (Map tasks /Reduce tasks) being run periodically to the jobtracker, which co-ordinates the job run.
Upvotes: 1