How to estimate MapReduce job time

Question

I have a Mapreduce program and when run on 1% of the dataset, this is the time it takes:

Job Counters
    Launched map tasks=3
    Launched reduce tasks=45
    Data-local map tasks=1
    Rack-local map tasks=2
    Total time spent by all maps in occupied slots (ms)=29338
    Total time spent by all reduces in occupied slots (ms)=200225
    Total time spent by all map tasks (ms)=29338
    Total time spent by all reduce tasks (ms)=200225
    Total vcore-seconds taken by all map tasks=29338
    Total vcore-seconds taken by all reduce tasks=200225
    Total megabyte-seconds taken by all map tasks=30042112
    Total megabyte-seconds taken by all reduce tasks=205030400

how can I extrapolate to know the time analyzing 100% of the data will take? My reasoning was that it will take 100 times more since 1% is one block but when run on the 100% it actually takes 134 times more.

The timing for 100% of the data

Job Counters
    Launched map tasks=2113
    Launched reduce tasks=45
    Data-local map tasks=1996
    Rack-local map tasks=117
    Total time spent by all maps in occupied slots (ms)=26800451
    Total time spent by all reduces in occupied slots (ms)=3607607
    Total time spent by all map tasks (ms)=26800451
    Total time spent by all reduce tasks (ms)=3607607
    Total vcore-seconds taken by all map tasks=26800451
    Total vcore-seconds taken by all reduce tasks=3607607
    Total megabyte-seconds taken by all map tasks=27443661824
    Total megabyte-seconds taken by all reduce tasks=3694189568

hvs · Accepted Answer

Predicting the map reduce performance based on its performance on a fraction of data is not easy. If you look at the log for the 1% run, it uses 45 reducers. The same number of reducers still used for 100% data. This means the amount of time the reducers used to process the complete output of the shuffle and sort phase, is not linear.

There are some mathematical models developed to predict the map reduce performance.

Here is one of those study paper gives much more insight on the map reduce performance.

http://personal.denison.edu/~bressoud/graybressoudmcurcsm2012.pdf

Hope this information is helpful.

How to estimate MapReduce job time

Answers (2)

Related Questions