user1179295
user1179295

Reputation: 736

Hadoop / AWS elastic map reduce performance

I am looking for a ballpark if any one has experience with this...

Does anyone have benchmarks on the speed of AWS's map reduce?

Lets say I have 100 million records and I am using hadoop streaming (a php script) to map, group, and reduce (with some simple php calculations). The average group will contain 1-6 records.

Also is it better/more cost effective to run a bunch of small instances or larger ones? I realize it is broken up into nodes within an instance but regardless will larger nodes have a higher I/O so that means faster per node per sever (and more cost efficient)?

Also with streaming how is the ratio of mappers vs reducers determined?

Upvotes: 0

Views: 500

Answers (1)

Sean Owen
Sean Owen

Reputation: 66866

I don't know if you can give a meaningful benchmark -- it's kind of like asking how fast a computer program generally runs. It's not possible to say how fast your program will run without knowing anything about the script.

If you mean, how fast are the instances that power an EMR job, they're the same spec as the underlying instances that your specify, from AWS.

If you want a very rough take on the how EMR performs differently: I'd say you will probably run into I/O bottleneck before CPU bottleneck.

In theory this means you should run many small instances and ask for rack diversity, in order to maybe grab more I/O resources from across more machines rather than let them compete. In practice I've found that fewer, higher I/O instances can be more effective. But even this impression doesn't always hold -- really depends on how busy the zone is and where your jobs are scheduled.

Upvotes: 1

Related Questions