how to force hive to distribute rows equally in insert overwrite into a partitioned table from another table among the reducers to improve performance

Question

I want to insert into a partitioned hive table from another hive table. The data is going in a single partition in the target table. The problem is all the reducers completing very fast but one of the reducers is taking a long time as all work is going to that single reducer.

I want to find a way to set a work equally distributed among all the reducers. Is there any way to do so? How can I improve the performance of the insert overwrite?

Source Table DDL :

 CREATE EXTERNAL TABLE employee ( id INT,first_name String,latst_name String,email String,gender String) STORED AS TEXTFILE '/emp/data'

TARGET TABLE DDL :

 CREATE EXTERNAL TABLE employee_stage ( id INT,first_name String,latst_name String,email String,gender String) PARTITIONED BY (batch_id bigint) STORED AS ORC LOCATION '/stage/emp/data'

Here is the data snapshot

1   Helen   Perrie  hperrie0@lulu.com   Female
2   Rafaelita   Jancso  rjancso1@cdbaby.com Female
3   Letti   Kelley  lkelley2@slideshare.net Female
4   Adela   Dmisek  admisek3@state.gov  Female
5   Lay Reyner  lreyner4@wired.com  Male
6   Robby   Felder  rfelder5@microsoft.com  Male
7   Thayne  Brunton tbrunton6@sun.com   Male
8   Lorrie  Roony   lroony7@oracle.com  Male
9   Hodge   Straun  hstraun8@w3.org Male
10  Gawain  Tomblett    gtomblett9@toplist.cz   Male
11  Carey   Facher  cfachera@ca.gov Male
12  Pamelina    Elijahu pelijahub@goo.ne.jp Female
13  Carmelle    Dabs    cdabsc@bizjournals.com  Female
14  Moore   Baldrick    mbaldrickd@yandex.ru    Male
15  Sheff   Morin   smorine@purevolume.com  Male
16  Zed Eary    zearyf@livejournal.com  Male
17  Angus   Pollastrone apollastroneg@wikispaces.com    Male
18  Moises  Hubach  mhubachh@usnews.com Male
19  Lilllie Beetham lbeethami@diigo.com Female
20  Mortimer    De Hooge    mdehoogej@ucoz.com  Male

The source table contains more than 100M of records.

Here is the hql I am using.

insert overwrite table employee_stage
PARTITION (batch_id)
SELECT
  id,
  first_name,
  latst_name,
  email,
  gender,
  123456789 as batch_id
FROM employee;

The data is going in a single partition.

Please let me know in this condition how can I improve the performance? Is there any way to distribute the rows equally among all the reducers?

leftjoin · Accepted Answer

I suppose you are not doing JOINS or some other heavy transformations in your insert overwrite query and skew is really happened during insert. Because if you do then question should be not about insert.

Try to add distribute by batch_id to your insert query and re-run. If still running with skew then check your data. There are too many data for some particular batch_id or maybe you have a lot of nulls. There are different approaches of how to deal with skewed data. One of them is to filter out skewed keys and load them separately. Check long running reducer logs on job tracker, it will give you more information about where is a problem.

how to force hive to distribute rows equally in insert overwrite into a partitioned table from another table among the reducers to improve performance

Answers (1)

Related Questions