user3540466
user3540466

Reputation: 303

How to estimate MapReduce job time

I have a Mapreduce program and when run on 1% of the dataset, this is the time it takes:

Job Counters
    Launched map tasks=3
    Launched reduce tasks=45
    Data-local map tasks=1
    Rack-local map tasks=2
    Total time spent by all maps in occupied slots (ms)=29338
    Total time spent by all reduces in occupied slots (ms)=200225
    Total time spent by all map tasks (ms)=29338
    Total time spent by all reduce tasks (ms)=200225
    Total vcore-seconds taken by all map tasks=29338
    Total vcore-seconds taken by all reduce tasks=200225
    Total megabyte-seconds taken by all map tasks=30042112
    Total megabyte-seconds taken by all reduce tasks=205030400

how can I extrapolate to know the time analyzing 100% of the data will take? My reasoning was that it will take 100 times more since 1% is one block but when run on the 100% it actually takes 134 times more.

The timing for 100% of the data

Job Counters
    Launched map tasks=2113
    Launched reduce tasks=45
    Data-local map tasks=1996
    Rack-local map tasks=117
    Total time spent by all maps in occupied slots (ms)=26800451
    Total time spent by all reduces in occupied slots (ms)=3607607
    Total time spent by all map tasks (ms)=26800451
    Total time spent by all reduce tasks (ms)=3607607
    Total vcore-seconds taken by all map tasks=26800451
    Total vcore-seconds taken by all reduce tasks=3607607
    Total megabyte-seconds taken by all map tasks=27443661824
    Total megabyte-seconds taken by all reduce tasks=3694189568

Upvotes: 2

Views: 1634

Answers (2)

Fabian Hueske
Fabian Hueske

Reputation: 18987

As said before, predicting the runtime of a MapReduce job is not trivial. The problem is that execution time of a job is defined by the finishing time of the last parallel task. The execution time of a task depends on the hardware it runs on, concurrent workload, data skew, and so on...

The Starfish project from Duke University might be worth a look. It includes a performance model for Hadoop jobs, can tune the job configurations, and some nice visualisation features that ease debugging.

Upvotes: 0

hvs
hvs

Reputation: 36

Predicting the map reduce performance based on its performance on a fraction of data is not easy. If you look at the log for the 1% run, it uses 45 reducers. The same number of reducers still used for 100% data. This means the amount of time the reducers used to process the complete output of the shuffle and sort phase, is not linear.

There are some mathematical models developed to predict the map reduce performance.

Here is one of those study paper gives much more insight on the map reduce performance.

http://personal.denison.edu/~bressoud/graybressoudmcurcsm2012.pdf

Hope this information is helpful.

Upvotes: 2

Related Questions