Reputation: 624
Environment
Hadoop : 0.20.205.0
Number of machines in cluster : 2 nodes
Replication : set to 1
DFS Block size : 1MB
I put a 7.4MB file into HDFS using put command. I run fsck command to check the blocks distribution of the file among the datanodes. I see that all the 8 blocks of the file are going to only one node. This affects the load distribution and only one node always get used while running mapred tasks.
Is there a way that I can distribute the files to more than one datanode?
bin/hadoop dfsadmin -report
Configured Capacity: 4621738717184 (4.2 TB)
Present Capacity: 2008281120783 (1.83 TB)
DFS Remaining: 2008281063424 (1.83 TB)
DFS Used: 57359 (56.01 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 2 (6 total, 4 dead)
Name: 143.215.131.246:50010
Decommission Status : Normal
Configured Capacity: 2953506713600 (2.69 TB)
DFS Used: 28687 (28.01 KB)
Non DFS Used: 1022723801073 (952.49 GB)
DFS Remaining: 1930782883840(1.76 TB)
DFS Used%: 0%
DFS Remaining%: 65.37%
Last contact: Fri Jul 18 10:31:51 EDT 2014
bin/hadoop fs -put /scratch/rkannan3/hadoop/test/pg20417.txt /user/rkannan3
bin/hadoop fs -ls /user/rkannan3
Found 1 items
-rw------- 1 rkannan3 supergroup 7420270 2014-07-18 10:40 /user/rkannan3/pg20417.txt
bin/hadoop fsck /user/rkannan3 -files -blocks -locations
FSCK started by rkannan3 from /143.215.131.246 for path /user/rkannan3 at Fri Jul 18 10:43:13 EDT 2014
/user/rkannan3 <dir>
/user/rkannan3/pg20417.txt 7420270 bytes, 8 block(s): OK <==== All the 8 blocks in one DN
0. blk_3659272467883498791_1006 len=1048576 repl=1 [143.215.131.246:50010]
1. blk_-5158259524162513462_1006 len=1048576 repl=1 [143.215.131.246:50010]
2. blk_8006160220823587653_1006 len=1048576 repl=1 [143.215.131.246:50010]
3. blk_4541732328753786064_1006 len=1048576 repl=1 [143.215.131.246:50010]
4. blk_-3236307221351862057_1006 len=1048576 repl=1 [143.215.131.246:50010]
5. blk_-6853392225410344145_1006 len=1048576 repl=1 [143.215.131.246:50010]
6. blk_-2293710893046611429_1006 len=1048576 repl=1 [143.215.131.246:50010]
7. blk_-1502992715991891710_1006 len=80238 repl=1 [143.215.131.246:50010]
Upvotes: 0
Views: 2931
Reputation: 3937
If you want to have distribution on file level use at least a replication factor of 2. The first replica is always placed where the writer is located (see the introduction paragraph in http://waset.org/publications/16836/optimizing-hadoop-block-placement-policy-and-cluster-blocks-distribution); and normally one file has only one writer, so the first replica of several blocks of a file will always be on that node. You probably don't want to change that behaviour, because you want to have the option available to increase the minimum split size when you want to avoid spawning too many mappers without losing the data locality for mappers.
Upvotes: 1
Reputation: 1166
You must use Hadoop balancer command. Details below. Tutorials link
Balancer
Runs a cluster balancing utility. You can simply press Ctrl-C to stop the rebalancing process. Please find more details here
Usage: hadoop balancer [-threshold <threshold>]
-threshold <threshold> Percentage of disk capacity. This overwrites the default threshold.
Upvotes: 0