Georgi
Georgi

Reputation: 402

Hadoop map only job

My situation is like the following:

I have two MapReduce jobs. First one is MapReduce job which produces output sorted by key.

Then second Map only job will extract some part of the data and just collect it.

I have no reducer in second job.

Problem is I am not sure if the output from map only job will be sorted or it will be shuffled from the map function.

Upvotes: 0

Views: 1276

Answers (2)

Niels Basjes
Niels Basjes

Reputation: 10642

First of all: If your second job only contains a filter to include/exclude specific records then you are better of simply adding this filter to the end of your reducer of the first job.

A rather important fact of the MapReduce is that the reducer will sort the records in "some way" that you do not control. When writing a job you should assume the records are output in a random order.

If you really need all records to be output in a specific order then using the SecondarySort mechanism in combination with a single reducer is "easy" solution that doesn't scale well. The "hard" solution is what the "Tera sort" benchmark uses. Read this SO question for more insight into how that works: How does the MapReduce sort algorithm work?

Upvotes: 1

twid
twid

Reputation: 6686

No as zsxwing said, there wont be any processing done unless you specify reducer, then partitioning will be performed at map side and sorting and grouping will be done on reduce side.

Upvotes: 0

Related Questions