Reputation: 623

Hadoop performance

I installed hadoop 1.0.0 and tried out word counting example (single node cluster). It took 2m 48secs to complete. Then I tried standard linux word count program, which run in 10 milliseconds on the same set (180 kB data). Am I doing something wrong, or is Hadoop very very slow?

time hadoop jar /usr/share/hadoop/hadoop*examples*.jar wordcount someinput someoutput
12/01/29 23:04:41 INFO input.FileInputFormat: Total input paths to process : 30
12/01/29 23:04:41 INFO mapred.JobClient: Running job: job_201201292302_0001
12/01/29 23:04:42 INFO mapred.JobClient:  map 0% reduce 0%
12/01/29 23:05:05 INFO mapred.JobClient:  map 6% reduce 0%
12/01/29 23:05:15 INFO mapred.JobClient:  map 13% reduce 0%
12/01/29 23:05:25 INFO mapred.JobClient:  map 16% reduce 0%
12/01/29 23:05:27 INFO mapred.JobClient:  map 20% reduce 0%
12/01/29 23:05:28 INFO mapred.JobClient:  map 20% reduce 4%
12/01/29 23:05:34 INFO mapred.JobClient:  map 20% reduce 5%
12/01/29 23:05:35 INFO mapred.JobClient:  map 23% reduce 5%
12/01/29 23:05:36 INFO mapred.JobClient:  map 26% reduce 5%
12/01/29 23:05:41 INFO mapred.JobClient:  map 26% reduce 8%
12/01/29 23:05:44 INFO mapred.JobClient:  map 33% reduce 8%
12/01/29 23:05:53 INFO mapred.JobClient:  map 36% reduce 11%
12/01/29 23:05:54 INFO mapred.JobClient:  map 40% reduce 11%
12/01/29 23:05:56 INFO mapred.JobClient:  map 40% reduce 12%
12/01/29 23:06:01 INFO mapred.JobClient:  map 43% reduce 12%
12/01/29 23:06:02 INFO mapred.JobClient:  map 46% reduce 12%
12/01/29 23:06:06 INFO mapred.JobClient:  map 46% reduce 14%
12/01/29 23:06:09 INFO mapred.JobClient:  map 46% reduce 15%
12/01/29 23:06:11 INFO mapred.JobClient:  map 50% reduce 15%
12/01/29 23:06:12 INFO mapred.JobClient:  map 53% reduce 15%
12/01/29 23:06:20 INFO mapred.JobClient:  map 56% reduce 15%
12/01/29 23:06:21 INFO mapred.JobClient:  map 60% reduce 17%
12/01/29 23:06:28 INFO mapred.JobClient:  map 63% reduce 17%
12/01/29 23:06:29 INFO mapred.JobClient:  map 66% reduce 17%
12/01/29 23:06:30 INFO mapred.JobClient:  map 66% reduce 20%
12/01/29 23:06:36 INFO mapred.JobClient:  map 70% reduce 22%
12/01/29 23:06:37 INFO mapred.JobClient:  map 73% reduce 22%
12/01/29 23:06:45 INFO mapred.JobClient:  map 80% reduce 24%
12/01/29 23:06:51 INFO mapred.JobClient:  map 80% reduce 25%
12/01/29 23:06:54 INFO mapred.JobClient:  map 86% reduce 25%
12/01/29 23:06:55 INFO mapred.JobClient:  map 86% reduce 26%
12/01/29 23:07:02 INFO mapred.JobClient:  map 90% reduce 26%
12/01/29 23:07:03 INFO mapred.JobClient:  map 93% reduce 26%
12/01/29 23:07:07 INFO mapred.JobClient:  map 93% reduce 30%
12/01/29 23:07:09 INFO mapred.JobClient:  map 96% reduce 30%
12/01/29 23:07:10 INFO mapred.JobClient:  map 96% reduce 31%
12/01/29 23:07:12 INFO mapred.JobClient:  map 100% reduce 31%
12/01/29 23:07:22 INFO mapred.JobClient:  map 100% reduce 100%
12/01/29 23:07:28 INFO mapred.JobClient: Job complete: job_201201292302_0001
12/01/29 23:07:28 INFO mapred.JobClient: Counters: 29
12/01/29 23:07:28 INFO mapred.JobClient:   Job Counters 
12/01/29 23:07:28 INFO mapred.JobClient:     Launched reduce tasks=1
12/01/29 23:07:28 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=275346
12/01/29 23:07:28 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
12/01/29 23:07:28 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
12/01/29 23:07:28 INFO mapred.JobClient:     Launched map tasks=30
12/01/29 23:07:28 INFO mapred.JobClient:     Data-local map tasks=30
12/01/29 23:07:28 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=137186
12/01/29 23:07:28 INFO mapred.JobClient:   File Output Format Counters 
12/01/29 23:07:28 INFO mapred.JobClient:     Bytes Written=26287
12/01/29 23:07:28 INFO mapred.JobClient:   FileSystemCounters
12/01/29 23:07:28 INFO mapred.JobClient:     FILE_BYTES_READ=71510
12/01/29 23:07:28 INFO mapred.JobClient:     HDFS_BYTES_READ=89916
12/01/29 23:07:28 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=956282
12/01/29 23:07:28 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=26287
12/01/29 23:07:28 INFO mapred.JobClient:   File Input Format Counters 
12/01/29 23:07:28 INFO mapred.JobClient:     Bytes Read=85860
12/01/29 23:07:28 INFO mapred.JobClient:   Map-Reduce Framework
12/01/29 23:07:28 INFO mapred.JobClient:     Map output materialized bytes=71684
12/01/29 23:07:28 INFO mapred.JobClient:     Map input records=2574
12/01/29 23:07:28 INFO mapred.JobClient:     Reduce shuffle bytes=71684
12/01/29 23:07:28 INFO mapred.JobClient:     Spilled Records=6696
12/01/29 23:07:28 INFO mapred.JobClient:     Map output bytes=118288
12/01/29 23:07:28 INFO mapred.JobClient:     CPU time spent (ms)=39330
12/01/29 23:07:28 INFO mapred.JobClient:     Total committed heap usage (bytes)=5029167104
12/01/29 23:07:28 INFO mapred.JobClient:     Combine input records=8233
12/01/29 23:07:28 INFO mapred.JobClient:     SPLIT_RAW_BYTES=4056
12/01/29 23:07:28 INFO mapred.JobClient:     Reduce input records=3348
12/01/29 23:07:28 INFO mapred.JobClient:     Reduce input groups=1265
12/01/29 23:07:28 INFO mapred.JobClient:     Combine output records=3348
12/01/29 23:07:28 INFO mapred.JobClient:     Physical memory (bytes) snapshot=4936278016
12/01/29 23:07:28 INFO mapred.JobClient:     Reduce output records=1265
12/01/29 23:07:28 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=26102546432
12/01/29 23:07:28 INFO mapred.JobClient:     Map output records=8233

real    2m48.886s
user    0m3.300s
sys 0m0.304s


time wc someinput/*
  178  1001  8674 someinput/capacity-scheduler.xml
  178  1001  8674 someinput/capacity-scheduler.xml.bak
    7     7   196 someinput/commons-logging.properties
    7     7   196 someinput/commons-logging.properties.bak
   24    35   535 someinput/configuration.xsl
   80   122  1968 someinput/core-site.xml
   80   122  1972 someinput/core-site.xml.bak
    1     0     1 someinput/dfs.exclude
    1     0     1 someinput/dfs.include
   12    36   327 someinput/fair-scheduler.xml
   45   192  2141 someinput/hadoop-env.sh
   45   192  2139 someinput/hadoop-env.sh.bak
   20   137   910 someinput/hadoop-metrics2.properties
   20   137   910 someinput/hadoop-metrics2.properties.bak
  118   582  4653 someinput/hadoop-policy.xml
  118   582  4653 someinput/hadoop-policy.xml.bak
  241   623  6616 someinput/hdfs-site.xml
  241   623  6630 someinput/hdfs-site.xml.bak
  171   417  6177 someinput/log4j.properties
  171   417  6177 someinput/log4j.properties.bak
    1     0     1 someinput/mapred.exclude
    1     0     1 someinput/mapred.include
   12    15   298 someinput/mapred-queue-acls.xml
   12    15   298 someinput/mapred-queue-acls.xml.bak
  338   897  9616 someinput/mapred-site.xml
  338   897  9630 someinput/mapred-site.xml.bak
    1     1    10 someinput/masters
    1     1    18 someinput/slaves
   57    89  1243 someinput/ssl-client.xml.example
   55    85  1195 someinput/ssl-server.xml.example
 2574  8233 85860 total

real    0m0.009s
user    0m0.004s
sys 0m0.000s

Upvotes: 5

Answers (8)

user5393067

Reputation:

Even though Hadoop is not meant for this small file, we still can tune it to some extent. The file size is 180 kb. But the number of blocks are 30. You must have reduced "dfs.block.size" in hdfs-site.xml. As the input splits are more, the number of mappers are also more which is unnecessary in this case. Hadoop has to be tuned as per the number of nodes and input data. So you have to increase the "dfs.block.size" upto 64MB, to perform this word count with one mapper which will significantly improve the performance.

Upvotes: 0

iec2011007

Reputation: 1846

Hmm.. there is a confusion here, or let me create a confusion here.

Suppose you have a problem which could be solved in say O(n) complexity what hadoop will do if you apply lets suppose K machines then it will reduce the complexity by K times. So in your case the task should have performed faster(hadoop task).

WHAT WENT WRONG ?????

Assuming you have standard hadoop installation and all the standard hadoop configuration and also assuming you are by default running hadoop in local mode.

1) You are running the program on a single node so dont expect a running time of anything less than a standard program. (The case would have been different if you used a multi node cluster)

Now the question arises since single machine is used the running time should have been same ???

The answer is no, in hadoop the data is first read by the record reader which emits key value pairs which are passed to the mapper which then processes and emits key value pairs(supposing no combiner is used) then the data sorted and shuffled and then it is passed to the reducer phase then writing to hdfs is done. So see there are much more overheads here. And you perceiving a degraded performance because of these reasons.

You want to see what hadoop can do. Run the same task on a K node cluster and take peta bytes of data and also run a single threaded application. I promise you will be startled.

Upvotes: 0

Marek

Reputation: 319

Hadoop will generally have an overhead compared to the native applications you can run using the terminal. You'll definitely get a better time if you increase the number of mappers to 2, which you should be able to do. If the wordcount example you have doesn't support setting mappers and reducers try this one

https://github.com/marek5050/Sparkie/tree/master/Java

using

hadoop jar ./target/wordcount.jar -r 1 -m 4 <input> <output>

The power of Hadoop lies with the ability of spreading the work amongst multiple nodes to process GB/TBs of data, it's generally not gonna be more efficient than anything your computer is capable of doing within a couple of minutes.

Upvotes: 0

Arnon Rotem-Gal-Oz

Reputation: 25929

As Dave said Hadoop is optimized to handle large amounts of data not toy examples There's a tax for "waking up the elephant" to get things going which is not needed when you work on smaller sets. You can take a look at "About the performance of Map Reduce Jobs" for some details on what's going on

Upvotes: 7

David Gruzman

Reputation: 8088

in additional to other answers there is one more factor:
You have 30 files to process - so there are 30 tasks to execute. Hadoop MR overhead of 1 task execution is between 1 to 3 seconds. If you will merge data into one file - performance will improve seriously, while you will still have job start overhead.
I think local native program will always outperform hadoop MR. Hadoop is built with scalability and fault tolerance in mind - in many cases scarifying performance.

Upvotes: 1

Debaditya

Reputation: 2497

For improving the Hadoop performance :

Set the number of Mappers and Reducers.

[ By seeing the output of your program, I think you've used a number of reducers and mappers. Use it according to your need . Using too many mappers or reducers will not boost up the performance]
Use Larger chunk of data. (in Terabytes or at least in GBs)

[ In Hadoop, there is some funda of block size 64 mb.]
Setup Hadoop to some other terminals and try to run in multi node clusters. It will boost up the performance.

Hadoop is the next big thing.

Upvotes: 0

Tejas Patil

Reputation: 6169

Your input data was small and so you observed that hadoop took long time. Job creation process in hadoop is heavy as it involves lot of things. Had the input data being large, then you would see hadoop doing better over wc.

Upvotes: 2

Dave Newton

Reputation: 160291

This depends on a large number of factors, including your configuration, your machine, memory config, JVM settings, etc. You also need to subtract JVM startup time.

It runs much more quickly for me. That said, of course it will be slower on small data sets than a dedicated C program--consider what it's doing "behind the scenes".

Try it on a terabyte of data spread across a few thousand files and see what happens.

Upvotes: 11

Hadoop performance

Answers (8)

Related Questions