Vaibhav Wadikar
Vaibhav Wadikar

Reputation: 61

how to force hive to distribute rows equally in insert overwrite into a partitioned table from another table among the reducers to improve performance

I want to insert into a partitioned hive table from another hive table. The data is going in a single partition in the target table. The problem is all the reducers completing very fast but one of the reducers is taking a long time as all work is going to that single reducer.

I want to find a way to set a work equally distributed among all the reducers. Is there any way to do so? How can I improve the performance of the insert overwrite?

Source Table DDL :

 CREATE EXTERNAL TABLE employee ( id INT,first_name String,latst_name String,email String,gender String) STORED AS TEXTFILE '/emp/data'

TARGET TABLE DDL :

 CREATE EXTERNAL TABLE employee_stage ( id INT,first_name String,latst_name String,email String,gender String) PARTITIONED BY (batch_id bigint) STORED AS ORC LOCATION '/stage/emp/data'

Here is the data snapshot

1   Helen   Perrie  [email protected]   Female
2   Rafaelita   Jancso  [email protected] Female
3   Letti   Kelley  [email protected] Female
4   Adela   Dmisek  [email protected]  Female
5   Lay Reyner  [email protected]  Male
6   Robby   Felder  [email protected]  Male
7   Thayne  Brunton [email protected]   Male
8   Lorrie  Roony   [email protected]  Male
9   Hodge   Straun  [email protected] Male
10  Gawain  Tomblett    [email protected]   Male
11  Carey   Facher  [email protected] Male
12  Pamelina    Elijahu [email protected] Female
13  Carmelle    Dabs    [email protected]  Female
14  Moore   Baldrick    [email protected]    Male
15  Sheff   Morin   [email protected]  Male
16  Zed Eary    [email protected]  Male
17  Angus   Pollastrone [email protected]    Male
18  Moises  Hubach  [email protected] Male
19  Lilllie Beetham [email protected] Female
20  Mortimer    De Hooge    [email protected]  Male

The source table contains more than 100M of records.

Here is the hql I am using.

insert overwrite table employee_stage
PARTITION (batch_id)
SELECT
  id,
  first_name,
  latst_name,
  email,
  gender,
  123456789 as batch_id
FROM employee;

The data is going in a single partition.

Please let me know in this condition how can I improve the performance? Is there any way to distribute the rows equally among all the reducers?

Upvotes: 2

Views: 2009

Answers (1)

leftjoin
leftjoin

Reputation: 38335

I suppose you are not doing JOINS or some other heavy transformations in your insert overwrite query and skew is really happened during insert. Because if you do then question should be not about insert.

Try to add distribute by batch_id to your insert query and re-run. If still running with skew then check your data. There are too many data for some particular batch_id or maybe you have a lot of nulls. There are different approaches of how to deal with skewed data. One of them is to filter out skewed keys and load them separately. Check long running reducer logs on job tracker, it will give you more information about where is a problem.

Upvotes: 2

Related Questions