JRR
JRR

Reputation: 6152

hadoop performance comparison

When is Hadoop supposed to perform faster than a sequential program?

I ran word count on a single node hdfs, and the sequential version that opens the file from hdfs and iterates through each word is actually faster than the hadoop implementation from the tutorial, seems like most of the time was spent on spawning mappers.

Is this supposed to happen? Did I somehow have the wrong setup? Or does Hadoop is not supposed to be faster than a sequential program on a single node instance?? I am confused.

Upvotes: 0

Views: 495

Answers (3)

YoungHobbit
YoungHobbit

Reputation: 13402

What was the size of the data on which you have done this performance comparison? I am guessing it was small.

Hadoop is designed for processing large datasets, where size of data is in hundreds of GB or TB. There is a lot of start up over-head associated with hadoop, which is not the case for sequential program which you have executed.

Check this: Don't use Hadoop - your data isn't that big.

Another reference: MapReduce Job Overhead

Upvotes: 3

RojoSam
RojoSam

Reputation: 1496

WordCount is a very easy but not efficient example. Use it to validate if your cluster is working but NEVER for performance tests.

Let me explain why.

WordCount parse each line of text and for each word found write to the mapper output the record (WORD, 1). As you cam see, the full output of the mappers will be bigger than the input. And that mappers' bigger output will be the input of the reducers. Then, you need to read more than twice the amount of the input data and write to disk the original input + counters.

Additional to that, you need transfer the mapper output to the reducers. And if you are using only one reducer then the last step will be similar than your sequential job.

The job could be optimized, for example using combiners and multiple reducers.

Hadoop will be faster than local sequential jobs when the amount of the data be bigger than the local resources (ram, HD, cpu) and/or when the cost of initialize the containers and the transfet of data among them is minimized by the number of nodes working in parallel.

Upvotes: 0

Legato
Legato

Reputation: 1081

There are many parameters to this equation. How many servers/datanodes are used? how many CPU cores and available memory on each? is the data you're reading splittable? (e.g, binary formats aren't splittable and will be read by a single mapper), etc, etc.

There isn't enough such information in your question, and so these are principles you should be aware of when setting your performance expectations.

Upvotes: 0

Related Questions