Reputation: 502
OS : CentOS 7.2 CDH: CDH 5.8.0 Hosts: 11 ( 2 masters, 4 DN+NM, 5 NM)
yarn.nodemanager.resource.memory-mb 32074MB (for NodeManager group1) 82384MB (for NodeManager group2)
I have a hadoop cluster with 11 nodes, 2 masters, 4 slaves with datanode & nodemanager daemons running, 5 nodes with only nodemanager daemon running on them. On this cluster, I am running TestDFSIO benchmarking job with a 8TB load having 10000 files and file size of 800MB each. I have noticed few things which I could not understand properly.
1) The number of splits for this job is shown as 10000. How come it would be 10000 splits, my dfs.blocksize shows it is 128MB, going by this setting, the number of splits should be more than 10000 right?
2) In resoucemanager Web UI, I saw that on my 5 computenodes ( nodes on which only nodemanager is running) only 32 map tasks have run on each of these nodes. All other map tasks are being run on the 4 dn+nm nodes. Why is this happening? I have allocated my 9 slave nodes into two node groups. The 4 dn+nm nodes are in nodeManager group1 and other 5 slaves are in another nodeManager group2. yarn.nodemanager.resource.memory-mb for slaves in nodeManager group1 is 32074MB and for slaves in nodeManager group2 is 82384MB. I think ideally, the 5 slave nodes in nodeManager group2 should take more map taks. But why is this not happening?
Upvotes: 0
Views: 198
Reputation: 3688
afair - TestDFSIO will allocated a map task per file. That is why you end up with the same number of map tasks, even tho your block size is smaller.
how is you data locality configured? mappers will prefer nodes where the data is local. That would explain why you get more tasks on those nodes with DataNodes being local.
Upvotes: 2