AKC
AKC

Reputation: 1033

Number of Mappers: Mapreduce vs Sqoop

The numbers of mappers cant be defined on the mapreduce program as the total mappers will be selected based on the input split or size. But, why do we have an option to set num-mappers on the sqoop? When a mapreduce program takes number or mappers on own and doesnt let us select it, why sqoop is allowed to do it?

Upvotes: 0

Views: 128

Answers (1)

leftjoin
leftjoin

Reputation: 38335

sqoop will split your dataset using --split-by column. Read how it works here. Also run sqoop in verbose mode for better understanding how it works. It will get min and max value of split column and split the whole range on num-mappers parts, assuming split-column is evenly distributed. If it is not evenly distributed, sqoop will split the dataset between mappers not evenly (with a skew).

And the number of mappers is also configurable, at least in hive. For example if you are using Tez, you can configure min and max grouped split size:

set tez.grouping.min-size=16777216; -- 16 MB min split
set tez.grouping.max-size=1073741824; -- 1 GB max split

Also you can configure split number and if possible, Tez will start close to it number of mappers (some splits can be combined, something cannot be splitted, but it will affect the number of mappers):

set tez.grouping.split-count=5000;

This approach is not recommended, better use split size settings above.

For MR execution engine:

set mapreduce.input.fileinputformat.split.minsize=16777216; -- 16 MB
set mapreduce.input.fileinputformat.split.maxsize=1073741824; -- 1 GB

Controlling the number of mappers is not so easy because depends on many factors. For example ORC is splitted on stripe level, this means that you cannot split smaller than single stripe, etc. Read more about the number of mappers

Upvotes: 0

Related Questions