Danzo
Danzo

Reputation: 553

Finding a min value for each key causing efficiency issues

I have a very large table, CLAIMS, with the following columns:

p_key
c_key
claim_type

Each row is uniquely defined by p_key, c_key. Often there will be multiple c_keys for each p_key. The table would look like this:

p_key  c_key  claim_type
1      1      A
1      2      A
2      3      B
2      5      C
3      1      B 

I want to find the minimum c_key for each p_key. This is my query:

SELECT p_key, 
min(c_key) as min_ckey
from CLAIMS
GROUP BY p_key

The issue is, when I run this as a mapreduce job through HIVE CLI (0.13), the reduce portion takes 30 minutes to even get 5% done. I'm not entirely sure what could cause a simple query to take so long. This query gives the same issue:

SELECT p_key, 
row_number() OVER(PARTITION BY p_key ORDER BY c_key) as RowNum
from CLAIMS

So my question is why would the reduce portion of a seemingly simple mapreduce job take so long? Any suggestions on how to investigate this/improve the query would also be appreciated.

Upvotes: 0

Views: 44

Answers (2)

invoketheshell
invoketheshell

Reputation: 3897

The reduce phase occurs in three stages. When <=33% is shuffle, between 33% and 66% is sort, and >= 67% is the reduce phase.

Your job sounds like it is getting hung up in the shuffle portion of the reduce phase. My guess is that your data is spread all over and this portion is IO bound. Your observations are being moved to reducers.

You can try bucketing your data:

create table claim_bucket (p_key string, c_key string, claim_type string) 
 clustered by (p_key) into 6 buckets 
 row format delimited fields terminated by ",";

You may want more or less buckets and this will require some heavy lifting by hive inititally but should speed up subsequent queries of the table where p_key is used.

Of course you haven't left much else to go on here. If you post an edit and give more information you might get a better answer. Good luck.

Upvotes: 1

mattinbits
mattinbits

Reputation: 10428

Do you know if the data is imbalanced? If there is one p_key with a very large number of c_key values compared to the average case, then the reducer which deals with that p_key will take a very long time.

Alternatively, is it possible that there are a small number of p_key values in general? Since you're grouping by p_key that would limit the number of reducers doing useful work.

Upvotes: 1

Related Questions