Babu
Babu

Reputation: 169

get sum of the each possibility based on the column value in hive - Aggregate table

I have table with following columns.

enter image description here

For the above table I need to get the count by date for each cd depends on ind value combinations and expecting the below output table.

enter image description here

for the row2 in output table, one OK and one no is there for id 45 and so need to take count as 1 for the date 2020-02-24 because it has 1 ok

similarly, for row4, it has notok and no, so for this combination we need to take notok for the max date for the id 30

I need to develop this in hive and could someone suggest how we can implement this. I tried separate sub queries to write but it is hitting performance due to many joins(I am writing individual query to calculate each combination separately and joining the results)

Updated for other scenario:

I have below data in table.

enter image description here

When we give weightage, it looks as follows

enter image description here

First case: when we are doing group by date, for 1/1/2020 I am getting count 1 which is correct

2nd Case: for date 1/2/2020, we suppose to get only count 1 for notOk but it is giving 2(as it is looking for 1st case row for 1/2/2020 for cd 1.

Also another scienario:

when I have multiple records for the same cd in different dates, not giving right results.

enter image description here

I have 2 "ok"s for cd 1 in different dates. So We need to consider only count 1 and we need to drop other ok which is either 1/1/2020 or 1/2/2020 as it is for same cd.

Really appreciate for your help.

Thanks, Babu

Upvotes: 0

Views: 900

Answers (2)

Rajat Gupta
Rajat Gupta

Reputation: 26

If you need to take ind count for latest date for a given ID then the query will look like following

select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count 
from mytable_test where dt in (select max(dt) from mytable_test group by cd) group by dt;

However, If there are certain truth table conditions such as : for a given ID,
-choose OK if it has both OK and No. -choose not ok if it has both No and not ok.

then it might not be a very efficient one, but will work fine.

select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count 
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from  (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind)  group by dt;

explaination

first provide some weightage to the conditions of ind that you have provided . In this case, based on your example, I am assuming NOK will be of least weigtage and OK medium and not ok of highest

select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test

    +-------------+-----+---------+---------+--+
    |     dt      | cd  |   ind   | ind_wt  |
    +-------------+-----+---------+---------+--+
    | 2020-08-24  | 10  | ok      | 1       |
    | 2020-02-21  | 45  | No      | 0       |
    | 2020-02-24  | 45  | ok      | 1       |
    | 2020-08-25  | 20  | No      | 0       |
    | 2020-10-09  | 30  | not ok  | 2       |
    | 2020-10-13  | 30  | not ok  | 2       |
    | 2020-10-21  | 30  | No      | 0       |
    | 2020-10-23  | 30  | No      | 0       |
    | 2020-09-14  | 12  | No      | 0       |
    +-------------+-----+---------+---------+--+

next get the max weightage for each CD ( in the wt block)

select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd

+-----+------------------+--+
| cd  | ind_to_consider  |
+-----+------------------+--+
| 10  | 1                |
| 12  | 0                |
| 20  | 0                |
| 30  | 2                |
| 45  | 1                |
+-----+------------------+--+

Now that you have to decode the weight back to indicator so that you can get the latest date for each cd and max indicator .

select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from  (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind

+-------------+--+
|     _c0     |
+-------------+--+
| 2020-08-24  |
| 2020-09-14  |
| 2020-08-25  |
| 2020-10-13  |
| 2020-02-24  |
+-------------+--+

then use these dates to get the piviot

select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count 
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from  (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from  mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind)  group by dt;



+-------------+-----------+-----------+---------------+--+
|     dt      | ok_count  | no_count  | not_ok_count  |
+-------------+-----------+-----------+---------------+--+
| 2020-02-24  | 1         | 0         | 0             |
| 2020-08-24  | 1         | 0         | 0             |
| 2020-08-25  | 0         | 1         | 0             |
| 2020-09-14  | 0         | 1         | 0             |
| 2020-10-13  | 0         | 0         | 1             |
+-------------+-----------+-----------+---------------+--+

Upvotes: 1

GMB
GMB

Reputation: 222432

Use conditional aggregation:

select date,
    sum(case when ind = 'ok'     then 1 else 0 end) ok_count,
    sum(case when ind = 'No'     then 1 else 0 end) no_count,
    sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from mytable
group by date

Or, if you want to only take in account the latest row per id, we can pre-filter with row_number() first:

select date,
    sum(case when ind = 'ok'     then 1 else 0 end) ok_count,
    sum(case when ind = 'No'     then 1 else 0 end) no_count,
    sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from (
    select t.*, row_number() over(partition by id order by date desc) rn
    from mytable t
) t
where rn = 1
group by date

Upvotes: 0

Related Questions