sds
sds

Reputation: 60064

Treat missing rows as having 0 data

Suppose I have a table describing workers' contribution to projects

project worker contribution
1       1      2
1       2      3
2       1      4

To compute the worker impact, I can do

select t.project, t.worker, t.contribution, p.total, 
       t.contribution / p.total as relative 
from my_table t
join (select project, sum(contribution) as total
      from my_table group by project) p
on t.project = p.project

to get new_table

project worker contribution total relative
1       1      1            5     .2
1       2      4            5     .8
2       1      4            4     1

if I now compute the average relative contribution using

select worker, avg(relative) as avg_rel
from new_table group by worker

I will see

worker avg_rel
1      .6
2      .8

which disregards the 0 contribution of worker 2 to project 2.

How do I take that into account? I.e., I want to get

worker avg_rel
1      .6
2      .4

as if the original table also included the line

2       2      0

Thanks.

Upvotes: 0

Views: 78

Answers (2)

Patrick Hofman
Patrick Hofman

Reputation: 157108

Couldn't you use a left outer join like this? Maybe join this with a list of workers?

select t.project, t.worker, t.contribution, p.total, 
   case when coalesce(p.total, 0) = 0 then 0 else t.contribution / p.total end as relative 
from my_table t
full
outer
join ( select distinct worker
         from my_table
       ) w
on t.worker = w.worker
join (select project, sum(contribution) as total
  from my_table group by project) p
on t.project = p.project

Upvotes: 1

Gordon Linoff
Gordon Linoff

Reputation: 1270883

I'm not sure this will work in Hive, but here is a SQL solution:

select w.worker, avg(coalesce(t.relative, 0.0)) as avg_rel
from (select distinct project from my_table) p cross join
     (select distinct worker from my_table) w left outer join
     my_table t
     on t.project = p.project and t.worker = w.worker
group by w.worker;

Upvotes: 1

Related Questions