Hemant
Hemant

Reputation: 94

Carry forward a row if missing for a date | Bigquery

I have data this way

Column1 Column2  Date
d1        b1    1/1/2020
d1        b2    1/1/2020
d1        b3    1/1/2020
d2        b2    1/2/2020
d1        b3    1/2/2020
d1        b4    1/2/2020
d1        b1    1/3/2020
d2        b3    1/3/2020
d2        b3    1/4/2020

The grain of the above data is Col1, Col2 For all this grain, if the data at grain level does not come on next day, I need to repeat from previous day. Meaning if I run the query toady and today is 1/4/2020, the expected output is as follows

Column1 Column2  Date     Status
d1        b1    1/1/2020  FromInput
d1        b1    1/2/2020  Repeated
d1        b1    1/3/2020  FromInput
d1        b1    1/4/2020  FromInput

d1        b2    1/1/2020  FromInput
d1        b2    1/2/2020  Repeated
d1        b2    1/3/2020  Repeated
d1        b2    1/4/2020  Repeated

d1        b3    1/1/2020  FromInput
d1        b3    1/2/2020  FromInput
d1        b3    1/3/2020  Repeated
d1        b3    1/4/2020  Repeated

d1        b4    1/2/2020  FromInput
d1        b4    1/3/2020  Repeated
d1        b4    1/4/2020  Repeated

d2        b2    1/2/2020  FromInput
d2        b2    1/3/2020  Repeated 
d2        b2    1/4/2020  Repeated

d2        b3    1/3/2020  FromInput
d2        b3    1/4/2020  FromInput

Upvotes: 0

Views: 141

Answers (2)

GMB
GMB

Reputation: 222702

You can simply cross join the distinct values of col1, col2 and date to generate all possible combinations, and then bring the original table with a left join. The rest is just a little conditional logic in the select clause to compute the status:

select
    c1.col1,
    c2.col2,
    d.date,
    case when t.col1 is null then 'Repeated' else 'From Input' end status
from (select distinct col1 from mytable) c1
cross join (select distinct col2 from mytable) c2
cross join (select distinct date from mytable) d
left join mytable t
    on  t.col1 = c1.col1
    and t.col2 = c2.col2
    and t.date = c3.date

Upvotes: 0

Mikhail Berlyant
Mikhail Berlyant

Reputation: 173190

Below is for BigQuery Standard SQL

#standardSQL
WITH days AS (
  SELECT column1, column2, day FROM (
    SELECT column1, column2, 
      MIN(PARSE_DATE('%m/%d/%Y', day)) min_day,
      PARSE_DATE('%m/%d/%Y', '1/4/2020') max_day -- or just CURRENT_DATE() in real use case
    FROM `project.dataset.table`
    GROUP BY column1, column2
  ),
  UNNEST(GENERATE_DATE_ARRAY(min_day, max_day)) day
)
SELECT d.column1, d.column2, d.day, 
  IFNULL(t.value,
    LAST_VALUE(value IGNORE NULLS) 
      OVER(PARTITION BY d.column1, d.column2 ORDER BY d.day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
  ) value,
  IF(t.value IS NULL, 'Repeated', 'From Input') Status
FROM days d
LEFT JOIN `project.dataset.table` t
ON d.column1 = t.column1
AND d.column2 = t.column2
AND d.day = PARSE_DATE('%m/%d/%Y', t.day)
-- ORDER BY column1, column2, day

You can test, play with above using sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'd1' column1, 'b1' column2, '1/1/2020' day, 1 value UNION ALL
  SELECT 'd1', 'b2', '1/1/2020', 3 UNION ALL
  SELECT 'd1', 'b3', '1/1/2020', 4 UNION ALL
  SELECT 'd2', 'b2', '1/2/2020', 7 UNION ALL
  SELECT 'd1', 'b3', '1/2/2020', 5 UNION ALL
  SELECT 'd1', 'b4', '1/2/2020', 6 UNION ALL
  SELECT 'd1', 'b1', '1/3/2020', 2 UNION ALL
  SELECT 'd2', 'b3', '1/3/2020', 8 UNION ALL
  SELECT 'd2', 'b3', '1/4/2020', 9 
), days AS (
  SELECT column1, column2, day FROM (
    SELECT column1, column2, 
      MIN(PARSE_DATE('%m/%d/%Y', day)) min_day,
      PARSE_DATE('%m/%d/%Y', '1/4/2020') max_day -- or just CURRENT_DATE() in real use case
    FROM `project.dataset.table`
    GROUP BY column1, column2
  ),
  UNNEST(GENERATE_DATE_ARRAY(min_day, max_day)) day
)
SELECT d.column1, d.column2, d.day, 
  IFNULL(t.value,
    LAST_VALUE(value IGNORE NULLS) 
      OVER(PARTITION BY d.column1, d.column2 ORDER BY d.day ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
  ) value,
  IF(t.value IS NULL, 'Repeated', 'From Input') Status
FROM days d
LEFT JOIN `project.dataset.table` t
ON d.column1 = t.column1
AND d.column2 = t.column2
AND d.day = PARSE_DATE('%m/%d/%Y', t.day)
-- ORDER BY column1, column2, day  

with output

Row column1 column2 day         value   Status   
1   d1      b1      2020-01-01  1       From Input   
2   d1      b1      2020-01-02  1       Repeated     
3   d1      b1      2020-01-03  2       From Input   
4   d1      b1      2020-01-04  2       Repeated     

5   d1      b2      2020-01-01  3       From Input   
6   d1      b2      2020-01-02  3       Repeated     
7   d1      b2      2020-01-03  3       Repeated     
8   d1      b2      2020-01-04  3       Repeated    

9   d1      b3      2020-01-01  4       From Input   
10  d1      b3      2020-01-02  5       From Input   
11  d1      b3      2020-01-03  5       Repeated     
12  d1      b3      2020-01-04  5       Repeated     

13  d1      b4      2020-01-02  6       From Input   
14  d1      b4      2020-01-03  6       Repeated     
15  d1      b4      2020-01-04  6       Repeated     

16  d2      b2      2020-01-02  7       From Input   
17  d2      b2      2020-01-03  7       Repeated     
18  d2      b2      2020-01-04  7       Repeated     

19  d2      b3      2020-01-03  8       From Input   
20  d2      b3      2020-01-04  9       From Input   

Upvotes: 2

Related Questions