ableHercules
ableHercules

Reputation: 660

Hadoop Replication factor of 1 on a Four node cluster

I have hadoop setup on four nodes. One node is for Namenode and secondary NameNode. The other three are datanodes. I ran a sqoop job with the replication factor of 3. The sqoop job was successful and the data was on all the three datanodes. It took around 1.5 hours to complete the job with 6 mappers. I ran the same job with replication factor of 1. This job was also successful and it ran about for 1 hour with 12 number of mappers.
my questions are:

1. when i ran the job for second time with replication factor of 1 where is the data stored? (Is the data split and stored in all the three datanodes? (or) The data is stored on the machine from which i ran the job? )

2. I have 6 core processors on each datanode with 64 GB of ram. Which are the properties should i set to obtain optimum values for the sqoop job?

These are the logs for the first job:

15/06/30 00:21:28 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=749046 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=864 HDFS: Number of bytes written=253986997858 HDFS: Number of read operations=24 HDFS: Number of large read operations=0 HDFS: Number of write operations=12 Job Counters Launched map tasks=6 Other local map tasks=6 Total time spent by all maps in occupied slots (ms)=20582400 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=20582400 Total vcore-seconds taken by all map tasks=20582400 Total megabyte-seconds taken by all map tasks=73767321600 Map-Reduce Framework Map input records=162991238 Map output records=162991238 Input split bytes=864 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=187671 CPU time spent (ms)=21216950 Physical memory (bytes) snapshot=5210345472 Virtual memory (bytes) snapshot=57549950976 Total committed heap usage (bytes)=6410469376 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=253986997858 15/06/30 00:21:28 INFO mapreduce.ImportJobBase: Transferred 236.5438 GB in 5,524.6156 seconds (43.8439 MB/sec) 15/06/30 00:21:28 INFO mapreduce.ImportJobBase: Retrieved 162991238 records.

These are the logs for the second job:

15/06/30 10:21:02 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=1498130 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1744 HDFS: Number of bytes written=253986997858 HDFS: Number of read operations=48 HDFS: Number of large read operations=0 HDFS: Number of write operations=24 Job Counters Launched map tasks=12 Other local map tasks=12 Total time spent by all maps in occupied slots (ms)=22551454 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=22551454 Total vcore-seconds taken by all map tasks=22551454 Total megabyte-seconds taken by all map tasks=80824411136 Map-Reduce Framework Map input records=162991238 Map output records=162991238 Input split bytes=1744 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=186898 CPU time spent (ms)=21910100 Physical memory (bytes) snapshot=9802846208 Virtual memory (bytes) snapshot=115099107328 Total committed heap usage (bytes)=12298747904 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=253986997858 15/06/30 10:21:02 INFO mapreduce.ImportJobBase: Transferred 236.5438 GB in 3,647.7444 seconds (66.4029 MB/sec) 15/06/30 10:21:02 INFO mapreduce.ImportJobBase: Retrieved 162991238 records.

Upvotes: 1

Views: 385

Answers (1)

Sagar Bhalodiya
Sagar Bhalodiya

Reputation: 422

Here is my answers about your both questions. 1. When you are running with replication factor 1. your copy of data block in HDFS is one but data will be distributed on all three nodes. Data blocks are automatically distributed across cluster that's why.

  1. specifies number of mapper in your job according to core/slot available into your cluster that would be optimum.Here you have 6 core machine and i am assuming core assignment for mapper is 4 and reducer is 2. So you have 4*3*2(2 mapper could run on each core)=24 mapper would be optimal for this job. by default

Hope this clarify your doubt.

Upvotes: 1

Related Questions