Amin Raeiszadeh
Amin Raeiszadeh

Reputation: 208

How Hadoop writes on hard drives of each data node?

I want to know in each data node if we have four hds with 500GB capacity is better or one with 2TB in other word in hds of one data node writing between hds is parallel or not?

Upvotes: 2

Views: 1206

Answers (3)

SSaikia_JtheRocker
SSaikia_JtheRocker

Reputation: 5063

If you have 4 disks mounted as /disk1, /disk2, /disk3 and /disk4 for a datanode, it usually uses round robin to write to those disks. It's usually a better approach to have multiple disks, since, when Hadoop will try to read distinct blocks from separate disks concurrently it won't be limited by the I/O capability of a single disk.

Upvotes: 1

Praveen Sripati
Praveen Sripati

Reputation: 33555

Keeping the cooling/power and other aspects out of consideration. Multiple HDDs provides better R/W throughput than a single HDD of the same capacity. Since, we are talking about Big Data this makes much more sense. Also, multiple HDDs provide better fault tolerance than a larger single HDD.

Check this blog about the general h/w recommendations.

Upvotes: 2

Donald Miner
Donald Miner

Reputation: 39943

It does not read/write the same single block in parallel. However, it does read/write several blocks in parallel. That is, if you are just writing one file, you won't see any difference... but if you are running a MapReduce job with several tasks per node (typical), you will benefit from the additional throughput.

There are other considerations than 500GB v. 2TB. Physical space in the nodes, cost, heat/cooling, etc. For example, if you fill a box with four times as many drives, do your nodes need to be 2U instead of 1U with 2TB? But if you are just talking about performance I'd take 4x 500GB over 1x 2 TB any day.

Upvotes: 2

Related Questions