Sara
Sara

Reputation: 2436

map task input data

I am new to map/reduce. Is it possible that input of one map task be on different serves? Assume I want to simulate "word count" using map/reduce and I split data line by line(each line one piece). Is it true that each map task will refer to one piece of data and count the number of occurrence of each word in that piece?

Upvotes: 0

Views: 82

Answers (2)

abhinav
abhinav

Reputation: 1282

Data is split using InputSplit class. You can define your own input split class. Number of input splits is equal to number of mappers. So in theory if you have as many mappers as your input lines and then you write your inputsplit in such a way, each line can be fed as a input to map task. In general the input of map task is located on the same machine. Map reduce framework schedules map task in this way only. I suggest you read some basics of map reduce. Good video tutorials are available on cloudera website.

Upvotes: 1

RGC
RGC

Reputation: 382

The input file will be split based on the hdfs block size, and exactly one map task will be spawned for each of this split.

For example, by default, the hdfs block size is 64mb. Lets say your input file is of size 50mb. when you load this file into hdfs, it will be split into 2 splits of each 25mb. Hence 2 map tasks will be spawned to work on each input split. Let assume that one input split has 100 lines, then the mapper class(task) will call the map method 100 times, one for each of the line.

Upvotes: 1

Related Questions