Reputation: 660
I have hadoop setup on four nodes. One node is for Namenode and secondary NameNode. The other three are datanodes. I ran a sqoop job with the replication factor of 3. The sqoop job was successful and the data was on all the three datanodes. It took around 1.5 hours to complete the job with 6 mappers. I ran the same job with replication factor of 1. This job was also successful and it ran about for 1 hour with 12 number of mappers.
my questions are:
1. when i ran the job for second time with replication factor of 1 where is the data stored? (Is the data split and stored in all the three datanodes? (or) The data is stored on the machine from which i ran the job? )
2. I have 6 core processors on each datanode with 64 GB of ram. Which are the properties should i set to obtain optimum values for the sqoop job?
These are the logs for the first job:
15/06/30 00:21:28 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=749046
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=864
HDFS: Number of bytes written=253986997858
HDFS: Number of read operations=24
HDFS: Number of large read operations=0
HDFS: Number of write operations=12
Job Counters
Launched map tasks=6
Other local map tasks=6
Total time spent by all maps in occupied slots (ms)=20582400
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=20582400
Total vcore-seconds taken by all map tasks=20582400
Total megabyte-seconds taken by all map tasks=73767321600
Map-Reduce Framework
Map input records=162991238
Map output records=162991238
Input split bytes=864
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=187671
CPU time spent (ms)=21216950
Physical memory (bytes) snapshot=5210345472
Virtual memory (bytes) snapshot=57549950976
Total committed heap usage (bytes)=6410469376
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=253986997858
15/06/30 00:21:28 INFO mapreduce.ImportJobBase: Transferred 236.5438 GB in 5,524.6156 seconds (43.8439 MB/sec)
15/06/30 00:21:28 INFO mapreduce.ImportJobBase: Retrieved 162991238 records.
These are the logs for the second job:
15/06/30 10:21:02 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=1498130
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1744
HDFS: Number of bytes written=253986997858
HDFS: Number of read operations=48
HDFS: Number of large read operations=0
HDFS: Number of write operations=24
Job Counters
Launched map tasks=12
Other local map tasks=12
Total time spent by all maps in occupied slots (ms)=22551454
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=22551454
Total vcore-seconds taken by all map tasks=22551454
Total megabyte-seconds taken by all map tasks=80824411136
Map-Reduce Framework
Map input records=162991238
Map output records=162991238
Input split bytes=1744
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=186898
CPU time spent (ms)=21910100
Physical memory (bytes) snapshot=9802846208
Virtual memory (bytes) snapshot=115099107328
Total committed heap usage (bytes)=12298747904
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=253986997858
15/06/30 10:21:02 INFO mapreduce.ImportJobBase: Transferred 236.5438 GB in 3,647.7444 seconds (66.4029 MB/sec)
15/06/30 10:21:02 INFO mapreduce.ImportJobBase: Retrieved 162991238 records.
Upvotes: 1
Views: 385
Reputation: 422
Here is my answers about your both questions. 1. When you are running with replication factor 1. your copy of data block in HDFS is one but data will be distributed on all three nodes. Data blocks are automatically distributed across cluster that's why.
Hope this clarify your doubt.
Upvotes: 1