Confusion of how hadoop splits work

Question

We are Hadoop newbies, we realize that hadoop is for processing big data, and how Cartesian product is extremely expensive. However we are having some experiments where we are running a Cartesian product job similar to the one in the MapReduce Design Patterns book except with a reducer calculating avg of all intermediate results( including only upper half of A*B, so total is A*B/2). Our setting: 3 node cluster, block size = 64M, we tested different data set sizes ranging from 5000 points (130KB) to 10000 points (260KB).

Observations:

1- All map tasks are running on one node, sometimes on the master machine, other times on one of the slaves, but it never processed on more than one machine.Is there a way to force hadoop to distribute the splits therefore map tasks among machines? Based on what factors dose hadoop decide which machine is going to process the map tasks( in our case once it decided the master, in another case it decided a slave).

2- In all cases where we are testing the same job on different data sizes, we are getting 4 map tasks. Where dose the number 4 comes from?since our data size is less than the block size, why are we having 4 splits not 1.

3- Is there a way to see more information about exact splits for a running job.

Thanks in advance

Confusion of how hadoop splits work

Answers (1)

Related Questions