Reputation: 623
I installed hadoop 1.0.0 and tried out word counting example (single node cluster). It took 2m 48secs to complete. Then I tried standard linux word count program, which run in 10 milliseconds on the same set (180 kB data). Am I doing something wrong, or is Hadoop very very slow?
time hadoop jar /usr/share/hadoop/hadoop*examples*.jar wordcount someinput someoutput
12/01/29 23:04:41 INFO input.FileInputFormat: Total input paths to process : 30
12/01/29 23:04:41 INFO mapred.JobClient: Running job: job_201201292302_0001
12/01/29 23:04:42 INFO mapred.JobClient: map 0% reduce 0%
12/01/29 23:05:05 INFO mapred.JobClient: map 6% reduce 0%
12/01/29 23:05:15 INFO mapred.JobClient: map 13% reduce 0%
12/01/29 23:05:25 INFO mapred.JobClient: map 16% reduce 0%
12/01/29 23:05:27 INFO mapred.JobClient: map 20% reduce 0%
12/01/29 23:05:28 INFO mapred.JobClient: map 20% reduce 4%
12/01/29 23:05:34 INFO mapred.JobClient: map 20% reduce 5%
12/01/29 23:05:35 INFO mapred.JobClient: map 23% reduce 5%
12/01/29 23:05:36 INFO mapred.JobClient: map 26% reduce 5%
12/01/29 23:05:41 INFO mapred.JobClient: map 26% reduce 8%
12/01/29 23:05:44 INFO mapred.JobClient: map 33% reduce 8%
12/01/29 23:05:53 INFO mapred.JobClient: map 36% reduce 11%
12/01/29 23:05:54 INFO mapred.JobClient: map 40% reduce 11%
12/01/29 23:05:56 INFO mapred.JobClient: map 40% reduce 12%
12/01/29 23:06:01 INFO mapred.JobClient: map 43% reduce 12%
12/01/29 23:06:02 INFO mapred.JobClient: map 46% reduce 12%
12/01/29 23:06:06 INFO mapred.JobClient: map 46% reduce 14%
12/01/29 23:06:09 INFO mapred.JobClient: map 46% reduce 15%
12/01/29 23:06:11 INFO mapred.JobClient: map 50% reduce 15%
12/01/29 23:06:12 INFO mapred.JobClient: map 53% reduce 15%
12/01/29 23:06:20 INFO mapred.JobClient: map 56% reduce 15%
12/01/29 23:06:21 INFO mapred.JobClient: map 60% reduce 17%
12/01/29 23:06:28 INFO mapred.JobClient: map 63% reduce 17%
12/01/29 23:06:29 INFO mapred.JobClient: map 66% reduce 17%
12/01/29 23:06:30 INFO mapred.JobClient: map 66% reduce 20%
12/01/29 23:06:36 INFO mapred.JobClient: map 70% reduce 22%
12/01/29 23:06:37 INFO mapred.JobClient: map 73% reduce 22%
12/01/29 23:06:45 INFO mapred.JobClient: map 80% reduce 24%
12/01/29 23:06:51 INFO mapred.JobClient: map 80% reduce 25%
12/01/29 23:06:54 INFO mapred.JobClient: map 86% reduce 25%
12/01/29 23:06:55 INFO mapred.JobClient: map 86% reduce 26%
12/01/29 23:07:02 INFO mapred.JobClient: map 90% reduce 26%
12/01/29 23:07:03 INFO mapred.JobClient: map 93% reduce 26%
12/01/29 23:07:07 INFO mapred.JobClient: map 93% reduce 30%
12/01/29 23:07:09 INFO mapred.JobClient: map 96% reduce 30%
12/01/29 23:07:10 INFO mapred.JobClient: map 96% reduce 31%
12/01/29 23:07:12 INFO mapred.JobClient: map 100% reduce 31%
12/01/29 23:07:22 INFO mapred.JobClient: map 100% reduce 100%
12/01/29 23:07:28 INFO mapred.JobClient: Job complete: job_201201292302_0001
12/01/29 23:07:28 INFO mapred.JobClient: Counters: 29
12/01/29 23:07:28 INFO mapred.JobClient: Job Counters
12/01/29 23:07:28 INFO mapred.JobClient: Launched reduce tasks=1
12/01/29 23:07:28 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=275346
12/01/29 23:07:28 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/01/29 23:07:28 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/01/29 23:07:28 INFO mapred.JobClient: Launched map tasks=30
12/01/29 23:07:28 INFO mapred.JobClient: Data-local map tasks=30
12/01/29 23:07:28 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=137186
12/01/29 23:07:28 INFO mapred.JobClient: File Output Format Counters
12/01/29 23:07:28 INFO mapred.JobClient: Bytes Written=26287
12/01/29 23:07:28 INFO mapred.JobClient: FileSystemCounters
12/01/29 23:07:28 INFO mapred.JobClient: FILE_BYTES_READ=71510
12/01/29 23:07:28 INFO mapred.JobClient: HDFS_BYTES_READ=89916
12/01/29 23:07:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=956282
12/01/29 23:07:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=26287
12/01/29 23:07:28 INFO mapred.JobClient: File Input Format Counters
12/01/29 23:07:28 INFO mapred.JobClient: Bytes Read=85860
12/01/29 23:07:28 INFO mapred.JobClient: Map-Reduce Framework
12/01/29 23:07:28 INFO mapred.JobClient: Map output materialized bytes=71684
12/01/29 23:07:28 INFO mapred.JobClient: Map input records=2574
12/01/29 23:07:28 INFO mapred.JobClient: Reduce shuffle bytes=71684
12/01/29 23:07:28 INFO mapred.JobClient: Spilled Records=6696
12/01/29 23:07:28 INFO mapred.JobClient: Map output bytes=118288
12/01/29 23:07:28 INFO mapred.JobClient: CPU time spent (ms)=39330
12/01/29 23:07:28 INFO mapred.JobClient: Total committed heap usage (bytes)=5029167104
12/01/29 23:07:28 INFO mapred.JobClient: Combine input records=8233
12/01/29 23:07:28 INFO mapred.JobClient: SPLIT_RAW_BYTES=4056
12/01/29 23:07:28 INFO mapred.JobClient: Reduce input records=3348
12/01/29 23:07:28 INFO mapred.JobClient: Reduce input groups=1265
12/01/29 23:07:28 INFO mapred.JobClient: Combine output records=3348
12/01/29 23:07:28 INFO mapred.JobClient: Physical memory (bytes) snapshot=4936278016
12/01/29 23:07:28 INFO mapred.JobClient: Reduce output records=1265
12/01/29 23:07:28 INFO mapred.JobClient: Virtual memory (bytes) snapshot=26102546432
12/01/29 23:07:28 INFO mapred.JobClient: Map output records=8233
real 2m48.886s
user 0m3.300s
sys 0m0.304s
time wc someinput/*
178 1001 8674 someinput/capacity-scheduler.xml
178 1001 8674 someinput/capacity-scheduler.xml.bak
7 7 196 someinput/commons-logging.properties
7 7 196 someinput/commons-logging.properties.bak
24 35 535 someinput/configuration.xsl
80 122 1968 someinput/core-site.xml
80 122 1972 someinput/core-site.xml.bak
1 0 1 someinput/dfs.exclude
1 0 1 someinput/dfs.include
12 36 327 someinput/fair-scheduler.xml
45 192 2141 someinput/hadoop-env.sh
45 192 2139 someinput/hadoop-env.sh.bak
20 137 910 someinput/hadoop-metrics2.properties
20 137 910 someinput/hadoop-metrics2.properties.bak
118 582 4653 someinput/hadoop-policy.xml
118 582 4653 someinput/hadoop-policy.xml.bak
241 623 6616 someinput/hdfs-site.xml
241 623 6630 someinput/hdfs-site.xml.bak
171 417 6177 someinput/log4j.properties
171 417 6177 someinput/log4j.properties.bak
1 0 1 someinput/mapred.exclude
1 0 1 someinput/mapred.include
12 15 298 someinput/mapred-queue-acls.xml
12 15 298 someinput/mapred-queue-acls.xml.bak
338 897 9616 someinput/mapred-site.xml
338 897 9630 someinput/mapred-site.xml.bak
1 1 10 someinput/masters
1 1 18 someinput/slaves
57 89 1243 someinput/ssl-client.xml.example
55 85 1195 someinput/ssl-server.xml.example
2574 8233 85860 total
real 0m0.009s
user 0m0.004s
sys 0m0.000s
Upvotes: 5
Views: 9065
Reputation:
Even though Hadoop is not meant for this small file, we still can tune it to some extent. The file size is 180 kb. But the number of blocks are 30. You must have reduced "dfs.block.size" in hdfs-site.xml. As the input splits are more, the number of mappers are also more which is unnecessary in this case. Hadoop has to be tuned as per the number of nodes and input data. So you have to increase the "dfs.block.size" upto 64MB, to perform this word count with one mapper which will significantly improve the performance.
Upvotes: 0
Reputation: 1846
Hmm.. there is a confusion here, or let me create a confusion here.
Suppose you have a problem which could be solved in say O(n) complexity
what hadoop will do if you apply lets suppose K machines then it will reduce the complexity by K times. So in your case the task should have performed faster(hadoop task).
WHAT WENT WRONG ?????
Assuming you have standard hadoop installation and all the standard hadoop configuration and also assuming you are by default running hadoop in local mode.
1) You are running the program on a single node so dont expect a running time of anything less than a standard program. (The case would have been different if you used a multi node cluster)
Now the question arises since single machine is used the running time should have been same ???
The answer is no, in hadoop the data is first read by the record reader which emits key value pairs which are passed to the mapper which then processes and emits key value pairs(supposing no combiner is used) then the data sorted and shuffled and then it is passed to the reducer phase then writing to hdfs is done. So see there are much more overheads here. And you perceiving a degraded performance because of these reasons.
You want to see what hadoop can do. Run the same task on a K node cluster and take peta bytes of data and also run a single threaded application. I promise you will be startled.
Upvotes: 0
Reputation: 319
Hadoop will generally have an overhead compared to the native applications you can run using the terminal. You'll definitely get a better time if you increase the number of mappers to 2, which you should be able to do. If the wordcount example you have doesn't support setting mappers and reducers try this one
https://github.com/marek5050/Sparkie/tree/master/Java
using
hadoop jar ./target/wordcount.jar -r 1 -m 4 <input> <output>
The power of Hadoop lies with the ability of spreading the work amongst multiple nodes to process GB/TBs of data, it's generally not gonna be more efficient than anything your computer is capable of doing within a couple of minutes.
Upvotes: 0
Reputation: 25909
As Dave said Hadoop is optimized to handle large amounts of data not toy examples There's a tax for "waking up the elephant" to get things going which is not needed when you work on smaller sets. You can take a look at "About the performance of Map Reduce Jobs" for some details on what's going on
Upvotes: 7
Reputation: 8088
in additional to other answers there is one more factor:
You have 30 files to process - so there are 30 tasks to execute. Hadoop MR overhead of 1 task execution is between 1 to 3 seconds. If you will merge data into one file - performance will improve seriously, while you will still have job start overhead.
I think local native program will always outperform hadoop MR. Hadoop is built with scalability and fault tolerance in mind - in many cases scarifying performance.
Upvotes: 1
Reputation: 2497
For improving the Hadoop performance :
Set the number of Mappers and Reducers.
[ By seeing the output of your program, I think you've used a number of reducers and mappers. Use it according to your need . Using too many mappers or reducers will not boost up the performance]
Use Larger chunk of data. (in Terabytes or at least in GBs)
[ In Hadoop, there is some funda of block size 64 mb.]
Setup Hadoop to some other terminals and try to run in multi node clusters. It will boost up the performance.
Hadoop is the next big thing.
Upvotes: 0
Reputation: 6169
Your input data was small and so you observed that hadoop took long time. Job creation process in hadoop is heavy as it involves lot of things. Had the input data being large, then you would see hadoop doing better over wc.
Upvotes: 2
Reputation: 160170
This depends on a large number of factors, including your configuration, your machine, memory config, JVM settings, etc. You also need to subtract JVM startup time.
It runs much more quickly for me. That said, of course it will be slower on small data sets than a dedicated C program--consider what it's doing "behind the scenes".
Try it on a terabyte of data spread across a few thousand files and see what happens.
Upvotes: 11