Reputation: 61
I want to insert into a partitioned hive table from another hive table. The data is going in a single partition in the target table. The problem is all the reducers completing very fast but one of the reducers is taking a long time as all work is going to that single reducer.
I want to find a way to set a work equally distributed among all the reducers. Is there any way to do so? How can I improve the performance of the insert overwrite?
Source Table DDL :
CREATE EXTERNAL TABLE employee ( id INT,first_name String,latst_name String,email String,gender String) STORED AS TEXTFILE '/emp/data'
TARGET TABLE DDL :
CREATE EXTERNAL TABLE employee_stage ( id INT,first_name String,latst_name String,email String,gender String) PARTITIONED BY (batch_id bigint) STORED AS ORC LOCATION '/stage/emp/data'
Here is the data snapshot
1 Helen Perrie [email protected] Female
2 Rafaelita Jancso [email protected] Female
3 Letti Kelley [email protected] Female
4 Adela Dmisek [email protected] Female
5 Lay Reyner [email protected] Male
6 Robby Felder [email protected] Male
7 Thayne Brunton [email protected] Male
8 Lorrie Roony [email protected] Male
9 Hodge Straun [email protected] Male
10 Gawain Tomblett [email protected] Male
11 Carey Facher [email protected] Male
12 Pamelina Elijahu [email protected] Female
13 Carmelle Dabs [email protected] Female
14 Moore Baldrick [email protected] Male
15 Sheff Morin [email protected] Male
16 Zed Eary [email protected] Male
17 Angus Pollastrone [email protected] Male
18 Moises Hubach [email protected] Male
19 Lilllie Beetham [email protected] Female
20 Mortimer De Hooge [email protected] Male
The source table contains more than 100M of records.
Here is the hql I am using.
insert overwrite table employee_stage
PARTITION (batch_id)
SELECT
id,
first_name,
latst_name,
email,
gender,
123456789 as batch_id
FROM employee;
The data is going in a single partition.
Please let me know in this condition how can I improve the performance? Is there any way to distribute the rows equally among all the reducers?
Upvotes: 2
Views: 2009
Reputation: 38335
I suppose you are not doing JOINS or some other heavy transformations in your insert overwrite
query and skew is really happened during insert. Because if you do then question should be not about insert.
Try to add distribute by batch_id
to your insert query and re-run. If still running with skew then check your data. There are too many data for some particular batch_id
or maybe you have a lot of nulls. There are different approaches of how to deal with skewed data. One of them is to filter out skewed keys and load them separately. Check long running reducer logs on job tracker, it will give you more information about where is a problem.
Upvotes: 2