aqjune
aqjune

Reputation: 552

From Hadoop logs how can I find intermediate output byte sizes & reduce output bytes sizes?

From hadoop logs, How can I estimate the size of total intermediate outputs of Mappers(in Bytes) and the size of total outputs of Reducers(in Bytes)?

My mappers and reducers use LZO compression, and I want to know the size of mapper/reducer outputs after compression.

15/06/06 17:19:15 INFO mapred.JobClient:  map 100% reduce 94%
15/06/06 17:19:16 INFO mapred.JobClient:  map 100% reduce 98%
15/06/06 17:19:17 INFO mapred.JobClient:  map 100% reduce 99%
15/06/06 17:20:04 INFO mapred.JobClient:  map 100% reduce 100%
15/06/06 17:20:05 INFO mapred.JobClient: Job complete: job_201506061602_0026
15/06/06 17:20:05 INFO mapred.JobClient: Counters: 30
15/06/06 17:20:05 INFO mapred.JobClient:   Job Counters 
15/06/06 17:20:05 INFO mapred.JobClient:     Launched reduce tasks=401
15/06/06 17:20:05 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=1203745
15/06/06 17:20:05 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
15/06/06 17:20:05 INFO mapred.JobClient:     Rack-local map tasks=50
15/06/06 17:20:05 INFO mapred.JobClient:     Launched map tasks=400
15/06/06 17:20:05 INFO mapred.JobClient:     Data-local map tasks=350
15/06/06 17:20:05 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=6642599
15/06/06 17:20:05 INFO mapred.JobClient:   File Output Format Counters 
15/06/06 17:20:05 INFO mapred.JobClient:     Bytes Written=534808008
15/06/06 17:20:05 INFO mapred.JobClient:   FileSystemCounters
15/06/06 17:20:05 INFO mapred.JobClient:     FILE_BYTES_READ=247949371
15/06/06 17:20:05 INFO mapred.JobClient:     HDFS_BYTES_READ=168030609
15/06/06 17:20:05 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=651797418
15/06/06 17:20:05 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=534808008
15/06/06 17:20:05 INFO mapred.JobClient:   File Input Format Counters 
15/06/06 17:20:05 INFO mapred.JobClient:     Bytes Read=167978609
15/06/06 17:20:05 INFO mapred.JobClient:   Map-Reduce Framework
15/06/06 17:20:05 INFO mapred.JobClient:     Map output materialized bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient:     Map input records=3774768
15/06/06 17:20:05 INFO mapred.JobClient:     Reduce shuffle bytes=354979707
15/06/06 17:20:05 INFO mapred.JobClient:     Spilled Records=56007636
15/06/06 17:20:05 INFO mapred.JobClient:     Map output bytes=336045816
15/06/06 17:20:05 INFO mapred.JobClient:     Total committed heap usage (bytes)=592599187456
15/06/06 17:20:05 INFO mapred.JobClient:     CPU time spent (ms)=9204120
15/06/06 17:20:05 INFO mapred.JobClient:     Combine input records=0
15/06/06 17:20:05 INFO mapred.JobClient:     SPLIT_RAW_BYTES=52000
15/06/06 17:20:05 INFO mapred.JobClient:     Reduce input records=28003818
15/06/06 17:20:05 INFO mapred.JobClient:     Reduce input groups=11478107
15/06/06 17:20:05 INFO mapred.JobClient:     Combine output records=0
15/06/06 17:20:05 INFO mapred.JobClient:     Physical memory (bytes) snapshot=516784615424
15/06/06 17:20:05 INFO mapred.JobClient:     Reduce output records=94351104
15/06/06 17:20:05 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=1911619866624
15/06/06 17:20:05 INFO mapred.JobClient:     Map output records=28003818

Upvotes: 0

Views: 518

Answers (1)

Maddy RS
Maddy RS

Reputation: 1031

You can get these info by using FileSystemCounters. Details of the terms used in this counter is given below:

FILE_BYTES_READ is the number of bytes read by local file system. Assume all the map input data comes from HDFS, then in map phase FILE_BYTES_READ should be zero. On the other hand, the input file of reducers are data on the reduce-side local disks which are fetched from map-side disks. Therefore, FILE_BYTES_READ denotes the total bytes read by reducers.

FILE_BYTES_WRITTEN consists of two parts. The first part comes from mappers. All the mappers will spill intermediate output to disk. All the bytes that mappers write to disk will be included in FILE_BYTES_WRITTEN. The second part comes from reducers. In the shuffle phase, all the reducers will fetch intermediate data from mappers and merge and spill to reducer-side disks. All the bytes that reducers write to disk will also be included in FILE_BYTES_WRITTEN.

HDFS_BYTES_READ denotes the bytes read by mappers from HDFS when the job starts. This data includes not only the content of source file but also metadata about splits.

HDFS_BYTES_WRITTEN denotes the bytes written to HDFS. It’s the number of bytes of the final output.

Upvotes: 2

Related Questions