Reputation: 3299
After a Hadoop job succeeded, a summary of various counters shown, see example below. My question would be what is included in the Total time spent by all map tasks
counter, specifically in the case the mapper job is not node-local, is the data copy time included or not?
17/01/25 09:06:12 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=2941
FILE: Number of bytes written=241959
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=3251
HDFS: Number of bytes written=2051
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=23168
Total time spent by all reduces in occupied slots (ms)=4957
Total time spent by all map tasks (ms)=5792
Total time spent by all reduce tasks (ms)=4957
Total vcore-milliseconds taken by all map tasks=5792
Total vcore-milliseconds taken by all reduce tasks=4957
Total megabyte-milliseconds taken by all map tasks=23724032
Total megabyte-milliseconds taken by all reduce tasks=5075968
Map-Reduce Framework
Map input records=9
Map output records=462
Map output bytes=4986
Map output materialized bytes=2941
Input split bytes=109
Combine input records=462
Combine output records=221
Reduce input groups=221
Reduce shuffle bytes=2941
Reduce input records=221
Reduce output records=221
Spilled Records=442
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=84
CPU time spent (ms)=2090
Physical memory (bytes) snapshot=471179264
Virtual memory (bytes) snapshot=4508950528
Total committed heap usage (bytes)=326631424
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=3142
File Output Format Counters
Bytes Written=2051
Upvotes: 1
Views: 772
Reputation: 4141
I believe that the data copy time is included into the Total time spent by all map tasks
metric.
First of all, if you check the server side code (mostly related to resource management), you can see that the MILLIS_MAPS
constant (which corresponds to the metric you are referring), updated inside the TaskAttempImpl
class, takes the duration of the task attempt. Task attempt launchTime is set when container is started and about to start the execution (and from my source code knowledge, it does not appear that neither of the components do move any data around at this point, only split metadata is being passed around).
Now, when container is started the InputFormat
is opening an InputStream
, which is responsible to get the data the Mapper needs to start processing (at this point you have different filesystems which the stream can be attached to, but give a look at DistributedFileSystem
). You can check the steps performed in the MapTask.runNewMapper(...)
method, where you have:
input.initialize(split, mapperContext);
mapper.run(mapperContext);
(I am on Hadoop 2.6)
Upvotes: 2