Reputation: 208
I want to know in each data node if we have four hds with 500GB capacity is better or one with 2TB in other word in hds of one data node writing between hds is parallel or not?
Upvotes: 2
Views: 1206
Reputation: 5063
If you have 4 disks mounted as /disk1, /disk2, /disk3 and /disk4 for a datanode, it usually uses round robin to write to those disks. It's usually a better approach to have multiple disks, since, when Hadoop will try to read distinct blocks from separate disks concurrently it won't be limited by the I/O capability of a single disk.
Upvotes: 1
Reputation: 33555
Keeping the cooling/power and other aspects out of consideration. Multiple HDDs provides better R/W throughput than a single HDD of the same capacity. Since, we are talking about Big Data this makes much more sense. Also, multiple HDDs provide better fault tolerance than a larger single HDD.
Check this blog about the general h/w recommendations.
Upvotes: 2
Reputation: 39943
It does not read/write the same single block in parallel. However, it does read/write several blocks in parallel. That is, if you are just writing one file, you won't see any difference... but if you are running a MapReduce job with several tasks per node (typical), you will benefit from the additional throughput.
There are other considerations than 500GB v. 2TB. Physical space in the nodes, cost, heat/cooling, etc. For example, if you fill a box with four times as many drives, do your nodes need to be 2U instead of 1U with 2TB? But if you are just talking about performance I'd take 4x 500GB over 1x 2 TB any day.
Upvotes: 2