Reputation: 60064
Suppose I have a table describing workers' contribution to projects
project worker contribution
1 1 2
1 2 3
2 1 4
To compute the worker impact, I can do
select t.project, t.worker, t.contribution, p.total,
t.contribution / p.total as relative
from my_table t
join (select project, sum(contribution) as total
from my_table group by project) p
on t.project = p.project
to get new_table
project worker contribution total relative
1 1 1 5 .2
1 2 4 5 .8
2 1 4 4 1
if I now compute the average relative contribution using
select worker, avg(relative) as avg_rel
from new_table group by worker
I will see
worker avg_rel
1 .6
2 .8
which disregards the 0 contribution of worker 2
to project 2
.
How do I take that into account? I.e., I want to get
worker avg_rel
1 .6
2 .4
as if the original table also included the line
2 2 0
Thanks.
Upvotes: 0
Views: 78
Reputation: 157108
Couldn't you use a left outer join like this? Maybe join this with a list of workers?
select t.project, t.worker, t.contribution, p.total,
case when coalesce(p.total, 0) = 0 then 0 else t.contribution / p.total end as relative
from my_table t
full
outer
join ( select distinct worker
from my_table
) w
on t.worker = w.worker
join (select project, sum(contribution) as total
from my_table group by project) p
on t.project = p.project
Upvotes: 1
Reputation: 1270883
I'm not sure this will work in Hive, but here is a SQL solution:
select w.worker, avg(coalesce(t.relative, 0.0)) as avg_rel
from (select distinct project from my_table) p cross join
(select distinct worker from my_table) w left outer join
my_table t
on t.project = p.project and t.worker = w.worker
group by w.worker;
Upvotes: 1