get sum of the each possibility based on the column value in hive - Aggregate table

Question

I have table with following columns.

For the above table I need to get the count by date for each cd depends on ind value combinations and expecting the below output table.

for the row2 in output table, one OK and one no is there for id 45 and so need to take count as 1 for the date 2020-02-24 because it has 1 ok

similarly, for row4, it has notok and no, so for this combination we need to take notok for the max date for the id 30

I need to develop this in hive and could someone suggest how we can implement this. I tried separate sub queries to write but it is hitting performance due to many joins(I am writing individual query to calculate each combination separately and joining the results)

Updated for other scenario:

I have below data in table.

When we give weightage, it looks as follows

First case: when we are doing group by date, for 1/1/2020 I am getting count 1 which is correct

2nd Case: for date 1/2/2020, we suppose to get only count 1 for notOk but it is giving 2(as it is looking for 1st case row for 1/2/2020 for cd 1.

Also another scienario:

when I have multiple records for the same cd in different dates, not giving right results.

I have 2 "ok"s for cd 1 in different dates. So We need to consider only count 1 and we need to drop other ok which is either 1/1/2020 or 1/2/2020 as it is for same cd.

Really appreciate for your help.

Thanks, Babu

Rajat Gupta · Accepted Answer

If you need to take ind count for latest date for a given ID then the query will look like following

select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count 
from mytable_test where dt in (select max(dt) from mytable_test group by cd) group by dt;

However, If there are certain truth table conditions such as : for a given ID,
-choose OK if it has both OK and No. -choose not ok if it has both No and not ok.

then it might not be a very efficient one, but will work fine.

select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count 
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from  (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind)  group by dt;

explaination

first provide some weightage to the conditions of ind that you have provided . In this case, based on your example, I am assuming NOK will be of least weigtage and OK medium and not ok of highest

select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test

    +-------------+-----+---------+---------+--+
    |     dt      | cd  |   ind   | ind_wt  |
    +-------------+-----+---------+---------+--+
    | 2020-08-24  | 10  | ok      | 1       |
    | 2020-02-21  | 45  | No      | 0       |
    | 2020-02-24  | 45  | ok      | 1       |
    | 2020-08-25  | 20  | No      | 0       |
    | 2020-10-09  | 30  | not ok  | 2       |
    | 2020-10-13  | 30  | not ok  | 2       |
    | 2020-10-21  | 30  | No      | 0       |
    | 2020-10-23  | 30  | No      | 0       |
    | 2020-09-14  | 12  | No      | 0       |
    +-------------+-----+---------+---------+--+

next get the max weightage for each CD ( in the wt block)

select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd

+-----+------------------+--+
| cd  | ind_to_consider  |
+-----+------------------+--+
| 10  | 1                |
| 12  | 0                |
| 20  | 0                |
| 30  | 2                |
| 45  | 1                |
+-----+------------------+--+

Now that you have to decode the weight back to indicator so that you can get the latest date for each cd and max indicator .

select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from  (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind

+-------------+--+
|     _c0     |
+-------------+--+
| 2020-08-24  |
| 2020-09-14  |
| 2020-08-25  |
| 2020-10-13  |
| 2020-02-24  |
+-------------+--+

then use these dates to get the piviot

select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count 
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from  (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind)  group by dt;



+-------------+-----------+-----------+---------------+--+
|     dt      | ok_count  | no_count  | not_ok_count  |
+-------------+-----------+-----------+---------------+--+
| 2020-02-24  | 1         | 0         | 0             |
| 2020-08-24  | 1         | 0         | 0             |
| 2020-08-25  | 0         | 1         | 0             |
| 2020-09-14  | 0         | 1         | 0             |
| 2020-10-13  | 0         | 0         | 1             |
+-------------+-----------+-----------+---------------+--+

get sum of the each possibility based on the column value in hive - Aggregate table

Answers (2)

Related Questions