Not getting performance with MapReduce

Question

I am new to Map-reduce and I am using Hadoop Pipes.

I have an input file which contains the number of records, one per line. I have written one simple program to print those lines in which three words are common. In map function I have emitted the word as a key and record as a value and compared those records in reduce function.

Then I compared Hadoop's performance with simple C++ program in which I read the records from file and split it into words and load the data in map. Map contains word as a key and record as a value. After loading all the data, I compared that data.
But I found that for doing the same task Hadoop Map-reduce takes long time compared with plain C++ program. When I run my program on hadoop it takes about 37 minutes where as it takes only about 5 minutes for simple C++ program.

Please, somebody help me to figure out whether I am doing wrong somewhere? Our application needs performance.

David Gruzman · Accepted Answer

There are several points which should be made here:
Hadoop is not high performance - it is scalable. Local program doing the same on small data set will always outperform hadoop. So its usage makes sense only when you want to run on cluster on machine and enjoy Hadoop's parallel processing.
Hadoop streaming is also not best thing performance wise since there are task switches per line. In many cases native hadoop program written in Java will have better performance

Not getting performance with MapReduce

Answers (1)

Related Questions