Whats the factor that decides how many map reduce jobs are used for a query?

Question

I have 2.5 million rows of data and 6 colums. When executing a query over hive I get sometimes 1 job sometimes 2 jobs. However, this seems to me completely random. Whats the measure for hive how many map reduce jobs it runs for query?

I appreciate your answer!

UPDATE

Queries:

SELECT r.title, r.rank FROM ratings r JOIN genres g ON r.title=g.title WHERE g.genre='Action' ORDER BY r.rank DESC LIMIT 1;

-> 2 Jobs

select distinct(genre) from genres

-> 1 Job

dimamah · Accepted Answer

Each job has typically a map and a reduce part.
The query engine decides how many jobs will be generated and what will happen in each job in its own map and reduce parts.
There will always be an optimization to some point to try and execute the least amount of jobs possible.

A (very) simplified example of the execution of your 1st query :
1st Job : Mappers will read both table r and g applying the filter g.genre='Action' then, the reducers will get the data from the mapper (distributed by the join key title) and perform the join.
2nd Job : The intermediate output of the 1st job was the joined data of the tables now you asked to order it so mappers will read the imtermediate output from the last stage, a single (!) reducer will get all the data from the mappers, the reducer will sort this data.

To make sure how many stages (jobs) each query generates you can use the EXPLAIN command as explained here

Whats the factor that decides how many map reduce jobs are used for a query?

Answers (1)

Related Questions