Reputation: 303
I have a Mapreduce program and when run on 1% of the dataset, this is the time it takes:
Job Counters
Launched map tasks=3
Launched reduce tasks=45
Data-local map tasks=1
Rack-local map tasks=2
Total time spent by all maps in occupied slots (ms)=29338
Total time spent by all reduces in occupied slots (ms)=200225
Total time spent by all map tasks (ms)=29338
Total time spent by all reduce tasks (ms)=200225
Total vcore-seconds taken by all map tasks=29338
Total vcore-seconds taken by all reduce tasks=200225
Total megabyte-seconds taken by all map tasks=30042112
Total megabyte-seconds taken by all reduce tasks=205030400
how can I extrapolate to know the time analyzing 100% of the data will take? My reasoning was that it will take 100 times more since 1% is one block but when run on the 100% it actually takes 134 times more.
The timing for 100% of the data
Job Counters
Launched map tasks=2113
Launched reduce tasks=45
Data-local map tasks=1996
Rack-local map tasks=117
Total time spent by all maps in occupied slots (ms)=26800451
Total time spent by all reduces in occupied slots (ms)=3607607
Total time spent by all map tasks (ms)=26800451
Total time spent by all reduce tasks (ms)=3607607
Total vcore-seconds taken by all map tasks=26800451
Total vcore-seconds taken by all reduce tasks=3607607
Total megabyte-seconds taken by all map tasks=27443661824
Total megabyte-seconds taken by all reduce tasks=3694189568
Upvotes: 2
Views: 1634
Reputation: 18987
As said before, predicting the runtime of a MapReduce job is not trivial. The problem is that execution time of a job is defined by the finishing time of the last parallel task. The execution time of a task depends on the hardware it runs on, concurrent workload, data skew, and so on...
The Starfish project from Duke University might be worth a look. It includes a performance model for Hadoop jobs, can tune the job configurations, and some nice visualisation features that ease debugging.
Upvotes: 0
Reputation: 36
Predicting the map reduce performance based on its performance on a fraction of data is not easy. If you look at the log for the 1% run, it uses 45 reducers. The same number of reducers still used for 100% data. This means the amount of time the reducers used to process the complete output of the shuffle and sort phase, is not linear.
There are some mathematical models developed to predict the map reduce performance.
Here is one of those study paper gives much more insight on the map reduce performance.
http://personal.denison.edu/~bressoud/graybressoudmcurcsm2012.pdf
Hope this information is helpful.
Upvotes: 2