mcserep
mcserep

Reputation: 3299

What does "time spent by map task" include on Hadoop?

After a Hadoop job succeeded, a summary of various counters shown, see example below. My question would be what is included in the Total time spent by all map tasks counter, specifically in the case the mapper job is not node-local, is the data copy time included or not?

17/01/25 09:06:12 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=2941
                FILE: Number of bytes written=241959
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=3251
                HDFS: Number of bytes written=2051
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=23168
                Total time spent by all reduces in occupied slots (ms)=4957
                Total time spent by all map tasks (ms)=5792
                Total time spent by all reduce tasks (ms)=4957
                Total vcore-milliseconds taken by all map tasks=5792
                Total vcore-milliseconds taken by all reduce tasks=4957
                Total megabyte-milliseconds taken by all map tasks=23724032
                Total megabyte-milliseconds taken by all reduce tasks=5075968
        Map-Reduce Framework
                Map input records=9
                Map output records=462
                Map output bytes=4986
                Map output materialized bytes=2941
                Input split bytes=109
                Combine input records=462
                Combine output records=221
                Reduce input groups=221
                Reduce shuffle bytes=2941
                Reduce input records=221
                Reduce output records=221
                Spilled Records=442
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=84
                CPU time spent (ms)=2090
                Physical memory (bytes) snapshot=471179264
                Virtual memory (bytes) snapshot=4508950528
                Total committed heap usage (bytes)=326631424
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=3142
        File Output Format Counters
                Bytes Written=2051

Upvotes: 1

Views: 772

Answers (1)

Serhiy
Serhiy

Reputation: 4141

I believe that the data copy time is included into the Total time spent by all map tasks metric.

First of all, if you check the server side code (mostly related to resource management), you can see that the MILLIS_MAPS constant (which corresponds to the metric you are referring), updated inside the TaskAttempImpl class, takes the duration of the task attempt. Task attempt launchTime is set when container is started and about to start the execution (and from my source code knowledge, it does not appear that neither of the components do move any data around at this point, only split metadata is being passed around).

Now, when container is started the InputFormat is opening an InputStream, which is responsible to get the data the Mapper needs to start processing (at this point you have different filesystems which the stream can be attached to, but give a look at DistributedFileSystem). You can check the steps performed in the MapTask.runNewMapper(...) method, where you have:

input.initialize(split, mapperContext);
mapper.run(mapperContext);

(I am on Hadoop 2.6)

Upvotes: 2

Related Questions