Reputation: 169
I have table with following columns.
For the above table I need to get the count by date for each cd depends on ind value combinations and expecting the below output table.
for the row2 in output table, one OK and one no is there for id 45 and so need to take count as 1 for the date 2020-02-24 because it has 1 ok
similarly, for row4, it has notok and no, so for this combination we need to take notok for the max date for the id 30
I need to develop this in hive and could someone suggest how we can implement this. I tried separate sub queries to write but it is hitting performance due to many joins(I am writing individual query to calculate each combination separately and joining the results)
Updated for other scenario:
I have below data in table.
When we give weightage, it looks as follows
First case: when we are doing group by date, for 1/1/2020 I am getting count 1 which is correct
2nd Case: for date 1/2/2020, we suppose to get only count 1 for notOk but it is giving 2(as it is looking for 1st case row for 1/2/2020 for cd 1.
Also another scienario:
when I have multiple records for the same cd in different dates, not giving right results.
I have 2 "ok"s for cd 1 in different dates. So We need to consider only count 1 and we need to drop other ok which is either 1/1/2020 or 1/2/2020 as it is for same cd.
Really appreciate for your help.
Thanks, Babu
Upvotes: 0
Views: 900
Reputation: 26
If you need to take ind count for latest date for a given ID then the query will look like following
select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count
from mytable_test where dt in (select max(dt) from mytable_test group by cd) group by dt;
However, If there are certain truth table conditions such as : for a given ID,
-choose OK if it has both OK and No.
-choose not ok if it has both No and not ok.
then it might not be a very efficient one, but will work fine.
select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind) group by dt;
explaination
first provide some weightage to the conditions of ind that you have provided . In this case, based on your example, I am assuming NOK will be of least weigtage and OK medium and not ok of highest
select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test
+-------------+-----+---------+---------+--+
| dt | cd | ind | ind_wt |
+-------------+-----+---------+---------+--+
| 2020-08-24 | 10 | ok | 1 |
| 2020-02-21 | 45 | No | 0 |
| 2020-02-24 | 45 | ok | 1 |
| 2020-08-25 | 20 | No | 0 |
| 2020-10-09 | 30 | not ok | 2 |
| 2020-10-13 | 30 | not ok | 2 |
| 2020-10-21 | 30 | No | 0 |
| 2020-10-23 | 30 | No | 0 |
| 2020-09-14 | 12 | No | 0 |
+-------------+-----+---------+---------+--+
next get the max weightage for each CD ( in the wt block)
select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd
+-----+------------------+--+
| cd | ind_to_consider |
+-----+------------------+--+
| 10 | 1 |
| 12 | 0 |
| 20 | 0 |
| 30 | 2 |
| 45 | 1 |
+-----+------------------+--+
Now that you have to decode the weight back to indicator so that you can get the latest date for each cd and max indicator .
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind
+-------------+--+
| _c0 |
+-------------+--+
| 2020-08-24 |
| 2020-09-14 |
| 2020-08-25 |
| 2020-10-13 |
| 2020-02-24 |
+-------------+--+
then use these dates to get the piviot
select dt,count(case when ind='ok' then 1 end) as ok_count,
count(case when ind='No' then 1 end) as No_count,
count(case when ind='not ok' then 1 end) as not_ok_count
from mytable_test where dt in (
select max(a.dt) from mytable_test a,(select cd, (case when ind_to_consider=0 then 'No' when ind_to_consider=1 then 'ok' when ind_to_consider=2 then 'not ok' end ) as decoeded_ind from (select cd,max(ind_wt) as ind_to_consider from (select dt,cd,ind,(case when ind='ok' then 1 when ind='No' then 0 when ind='not ok' then 2 end ) as ind_wt from mytable_test) wt group by cd) decoder) k where a.cd=k.cd and a.ind=k.decoeded_ind group by a.cd,a.ind) group by dt;
+-------------+-----------+-----------+---------------+--+
| dt | ok_count | no_count | not_ok_count |
+-------------+-----------+-----------+---------------+--+
| 2020-02-24 | 1 | 0 | 0 |
| 2020-08-24 | 1 | 0 | 0 |
| 2020-08-25 | 0 | 1 | 0 |
| 2020-09-14 | 0 | 1 | 0 |
| 2020-10-13 | 0 | 0 | 1 |
+-------------+-----------+-----------+---------------+--+
Upvotes: 1
Reputation: 222432
Use conditional aggregation:
select date,
sum(case when ind = 'ok' then 1 else 0 end) ok_count,
sum(case when ind = 'No' then 1 else 0 end) no_count,
sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from mytable
group by date
Or, if you want to only take in account the latest row per id
, we can pre-filter with row_number()
first:
select date,
sum(case when ind = 'ok' then 1 else 0 end) ok_count,
sum(case when ind = 'No' then 1 else 0 end) no_count,
sum(case when ind = 'not ok' then 1 else 0 end) not_ok_count
from (
select t.*, row_number() over(partition by id order by date desc) rn
from mytable t
) t
where rn = 1
group by date
Upvotes: 0