madsthaks
madsthaks

Reputation: 407

Looking to fill missing time periods with zero for each ID (Athena SQL)

In my data, I'm looking at quarterly counts per user. For all of the quarters that are missing, I'd like to impute them with 0. Here is an example:

Here is my current dataset:

ID Qtr    Count
1  2018Q1 1
1  2018Q3 1
1  2018Q4 2
2  2018Q1 4
2  2018Q2 6
2  2019Q3 2
2  2019Q4 1
3  2018Q1 8
3  2018Q2 5
3  2018Q3 2
3  2018Q4 4
3  2019Q1 5

Updated table:

ID Qtr    Count
1  2018Q1 1
1  2018Q2 0
1  2018Q3 1
1  2018Q4 2
1  2019Q1 0
1  2019Q2 0
1  2019Q3 0
1  2019Q4 0
2  2018Q1 4
2  2018Q2 6
2  2018Q3 0
2  2018Q4 0
2  2019Q1 0
2  2019Q2 0
2  2019Q3 2
2  2019Q4 1
3  2018Q1 8
3  2018Q2 5
3  2018Q3 2
3  2018Q4 4
3  2019Q1 5
3  2019Q2 0
3  2019Q3 0
3  2019Q4 0

Any help would be much appreciated.

Upvotes: 2

Views: 1273

Answers (3)

OjtamOjtam
OjtamOjtam

Reputation: 280

You can create table containing all your quarters (in example quarters_) and then join data (in example some_data) to it.

create table some_data (year number, quarter number, data varchar2(10));
insert into some_data values(2018,3,'AAAA');

create table  quarters_ (signature varchar2(20), year number, quarter_number number);
insert into quarters_ (signature, year, quarter_number) values ('2018Q1', 2018, 1);
insert into quarters_ (signature, year, quarter_number) values ('2018Q2', 2018, 2);
insert into quarters_ (signature, year, quarter_number) values ('2018Q3', 2018, 3);
insert into quarters_ (signature, year, quarter_number) values ('2018Q4', 2018, 4);
insert into quarters_ (signature, year, quarter_number) values ('2019Q1', 2019, 1);
insert into quarters_ (signature, year, quarter_number) values ('2019Q2', 2019, 2);
insert into quarters_ (signature, year, quarter_number) values ('2019Q3', 2019, 3);
insert into quarters_ (signature, year, quarter_number) values ('2019Q4', 2019, 4);

select signature as Qtr, count (s.data) as count
from quarters_ q
left join some_data s on s.quarter = q.quarter_number and s.year = q.year
group by q.signature

Upvotes: 0

Ravi Joshi
Ravi Joshi

Reputation: 661

So, to solve this we will first all the year+quarter values for a particular ID and then we can calculate the count for each year+quarter. So the solution is divided into two steps as follows:

  1. Create a dataset to get all the year+quarter values per ID.
  2. CROSS JOIN it with original data to get all the result.

1. CREATE A DATASET:

To create dataset the steps will be:

  1. Get unique ID,year combination
  2. Convert it year data to year+quarter data.
  3. Seperate out it array entry.

Get unique ID,year combination:

SELECT ID, ARRAY[substr(Qtr, 1, 4)] AS year from table_name 
GROUP BY ID, ARRAY[substr(Qtr, 1, 4)]

Convert it year data to year+quarter data:

WITH dataset AS (
SELECT ID, ARRAY[substr(Qtr, 1, 4)] AS year from table_name 
GROUP BY ID, ARRAY[substr(Qtr, 1, 4)]
)
select ID, transform(year, x->x||'Q1') || transform(year, x->x||'Q2') || transform(year, x->x||'Q3') ||
transform(year, x->x||'Q4') 
AS year from dataset

Seperate out it array entry:

WITH dataset AS (
    WITH inner_dataset AS (
      SELECT ID, ARRAY[substr(Qtr, 1, 4)] AS year from table_name 
      GROUP BY ID, ARRAY[substr(Qtr, 1, 4)]
     )
     select ID, transform(year, x->x||'Q1') || transform(year, x->x||'Q2') || 
     transform(year, x->x||'Q3') || transform(year, x->x||'Q4') 
     AS year from inner_dataset
)
SELECT ID,yr from dataset
CROSS JOIN UNNEST(year) AS t(yr)

Output of this query will be the ROWS of ID,year+quarter having all the quarters for each id. Now the next step will be to convert it into final dataset and then JOIN this data with original table.

2. CROSS JOIN it with original data to get all the result.

SELECT table_name.ID,final_dataset.yr,coalesce(table_name.Count,0)
FROM table_name
CROSS JOIN final_dataset
ON table_name.ID = final_dataset.ID AND table_name.Qtr = final_dataset.yr
ORDER BY table_name.ID,final_dataset.yr

coalesce() is used to convert NULL values to 0.

FINAL QUERY:

WITH final_dataset AS (
  WITH dataset AS (
      WITH inner_dataset AS (
        SELECT ID, ARRAY[substr(Qtr, 1, 4)] AS year from table_name 
        GROUP BY ID, ARRAY[substr(Qtr, 1, 4)]
       )
       select ID, transform(year, x->x||'Q1') || transform(year, x->x||'Q2') || 
       transform(year, x->x||'Q3') || transform(year, x->x||'Q4') 
       AS year from inner_dataset
  )
  SELECT ID,yr from dataset
  CROSS JOIN UNNEST(year) AS t(yr)
)
SELECT table_name.ID,final_dataset.yr,coalesce(table_name.Count,0)
FROM table_name
CROSS JOIN final_dataset
ON table_name.ID = final_dataset.ID AND table_name.Qtr = final_dataset.yr
ORDER BY table_name.ID,final_dataset.yr

Upvotes: 1

Piotr Findeisen
Piotr Findeisen

Reputation: 20770

You need to produce a table with zeroes. For this sequence() and UNNEST come particularily useful.

Here is an example where your data has only quarter and count (no id).

presto:default> SELECT coalesce(t.q, a.q), coalesce(t.c, 0)
             -> FROM (VALUES ('2018Q1', 1)) t(q, c)
             -> RIGHT OUTER JOIN (
             ->     SELECT CAST(y AS varchar) || 'Q' || CAST(x AS varchar) AS q
             ->     FROM UNNEST(sequence(2018, 2019)) _(y)
             ->     CROSS JOIN UNNEST(sequence(1, 4)) _(x)
             -> ) a ON t.q = a.q
             -> ;
 _col0  | _col1
--------+-------
 2018Q1 |     1
 2018Q3 |     0
 2019Q1 |     0
 2018Q4 |     0
 2019Q3 |     0
 2019Q2 |     0
 2018Q2 |     0
 2019Q4 |     0
(8 rows)

(Tested on Presto 322)

You can generalize it easily for id:

presto:default> WITH source_table AS (
             ->     SELECT *
             ->     FROM (VALUES (1, '2018Q1', 13), (2, '2018Q2', 42)) t(id, q, c)
             -> )
             -> SELECT id, q, coalesce(c, 0)
             -> FROM source_table
             -> RIGHT OUTER JOIN (
             ->     SELECT id, CAST(y AS varchar) || 'Q' || CAST(x AS varchar) AS q
             ->     FROM (SELECT DISTINCT id FROM source_table)
             ->     CROSS JOIN UNNEST(sequence(2018, 2019)) _(y)
             ->     CROSS JOIN UNNEST(sequence(1, 4)) _(x)
             -> ) a USING(id, q)
             -> ;
 id |   q    | _col2
----+--------+-------
  1 | 2018Q1 |    13
  2 | 2018Q2 |    42
  2 | 2019Q3 |     0
  1 | 2018Q2 |     0
  1 | 2019Q3 |     0
  2 | 2018Q4 |     0
  1 | 2019Q2 |     0
  1 | 2019Q4 |     0
  2 | 2019Q4 |     0
  2 | 2018Q3 |     0
  2 | 2019Q2 |     0
  1 | 2018Q3 |     0
  1 | 2019Q1 |     0
  2 | 2019Q1 |     0
  2 | 2018Q1 |     0
  1 | 2018Q4 |     0
(16 rows)

Upvotes: 0

Related Questions