Bheem Singh
Bheem Singh

Reputation: 21

Number of mappers in sqoop

I know sqoop has an option where we can set number of mappers(default is 4). In real-time projects who decides and how is the no of mappers decided? Do we use default or any arbitrary number? I know some theoretical links which say number of mappers is defined by your hardware and other considerations but it doesn't give me a practical way of deciding. Any help on how its actually done in production would be greatly appropriate.

Upvotes: 0

Views: 2022

Answers (1)

Jagrut Sharma
Jagrut Sharma

Reputation: 4754

The --num-mappers is a hint, and Sqoop may not use exactly the number specified. By default, the value is 4.

This parameter controls the parallelism. For example, if you are importing data from a database to a Hive table, the number of mappers specifies the concurrent connections Sqoop will make to the database to pull and execute data transfer in parallel. On one hand, using more mappers will lead to more parallelism and complete the data transfer faster. On the other hand, this will put more load on the database.

Increasing number of mappers beyond a certain point will probably saturate the database (or the DBA may have set a configured limit), so performance will stagnate.

Also, your cluster should have enough free resources to support the number of mappers you specify.

You can do some sample runs with a few different values and see what gives best performance for your dataset and environment.

Upvotes: 1

Related Questions