Reputation: 553
I have a very large table, CLAIMS, with the following columns:
p_key
c_key
claim_type
Each row is uniquely defined by p_key, c_key. Often there will be multiple c_keys for each p_key. The table would look like this:
p_key c_key claim_type
1 1 A
1 2 A
2 3 B
2 5 C
3 1 B
I want to find the minimum c_key for each p_key. This is my query:
SELECT p_key,
min(c_key) as min_ckey
from CLAIMS
GROUP BY p_key
The issue is, when I run this as a mapreduce job through HIVE CLI (0.13), the reduce portion takes 30 minutes to even get 5% done. I'm not entirely sure what could cause a simple query to take so long. This query gives the same issue:
SELECT p_key,
row_number() OVER(PARTITION BY p_key ORDER BY c_key) as RowNum
from CLAIMS
So my question is why would the reduce portion of a seemingly simple mapreduce job take so long? Any suggestions on how to investigate this/improve the query would also be appreciated.
Upvotes: 0
Views: 44
Reputation: 3897
The reduce phase occurs in three stages. When <=33% is shuffle, between 33% and 66% is sort, and >= 67% is the reduce phase.
Your job sounds like it is getting hung up in the shuffle portion of the reduce phase. My guess is that your data is spread all over and this portion is IO bound. Your observations are being moved to reducers.
You can try bucketing your data:
create table claim_bucket (p_key string, c_key string, claim_type string)
clustered by (p_key) into 6 buckets
row format delimited fields terminated by ",";
You may want more or less buckets and this will require some heavy lifting by hive inititally but should speed up subsequent queries of the table where p_key is used.
Of course you haven't left much else to go on here. If you post an edit and give more information you might get a better answer. Good luck.
Upvotes: 1
Reputation: 10428
Do you know if the data is imbalanced? If there is one p_key
with a very large number of c_key
values compared to the average case, then the reducer which deals with that p_key will take a very long time.
Alternatively, is it possible that there are a small number of p_key
values in general? Since you're grouping by p_key
that would limit the number of reducers doing useful work.
Upvotes: 1