sqlsql-serverdatetimewindow-functionsgaps-and-islands

Reputation: 129

Fill in gaps in year sequence in SQL Server

I have a table with the columns Age, Period and Year. The column Age always starts with 0 and doesn't have a fixed maximum value (I used 'Age' 0 to 30 in this example but the range could also be 0 to 100 etc.), the values Period and Year only appear in certain rows at certain ages.

However at what Age the values for Period and Year appear, changes and the solution should therefore be dynamic. What is the best way to fill in the NULL values with correct Period and Year?

I am using SQL Server.

Age Period  Year
-----------------
0   NULL    NULL
1   NULL    NULL
2   NULL    NULL
3   NULL    NULL
4   NULL    NULL
5   NULL    NULL
6   NULL    NULL
7   NULL    NULL
8   NULL    NULL
9   NULL    NULL
10  NULL    NULL
11  NULL    NULL
12  NULL    NULL
13  NULL    NULL
14  NULL    NULL
15  NULL    NULL
16  NULL    NULL
17  NULL    NULL
18  NULL    NULL
19  NULL    NULL
20  NULL    NULL
21  46      2065
22  NULL    NULL
23  NULL    NULL
24  NULL    NULL
25  NULL    NULL
26  51      2070
27  NULL    NULL
28  NULL    NULL
29  NULL    NULL
30  NULL    NULL

The result should look like this, the numbers for Period and Year should be increased and/or decrease from the last known values for Period and Year.

Age Period  Year      
-----------------
0   25      2044
1   26      2045
2   27      2046
3   28      2047
4   29      2048
5   30      2049
6   31      2050
7   32      2051
8   33      2052
9   34      2053
10  35      2054
11  36      2055
12  37      2056
13  38      2057
14  39      2058
15  40      2059
16  41      2060
17  42      2061
18  43      2062
19  44      2063
20  45      2064
21  46      2065
22  47      2066
23  48      2067
24  49      2068
25  50      2069
26  51      2070
27  52      2071
28  53      2072
29  54      2073
30  55      2074

Here is an UPDATE to my question as I didn't specify my requirement detailed enough: The solution should be able to handle different combinations of Age, Period and Year. My start point will always be a known Age, Period and Year combination. However, the combination Age = 21, Period = 46 and Year = 2065 (or 26|51|2070 as the second combination) in my example is not static. The value at Age = 21 could be anything e.g. Period = 2 and Year = 2021. Whatever the combination (Age, Period, Year) is, the solution should fill in the gaps and finish the sequence counting up and down from the known values for Period and Year. If a Period value sequence becomes negative the solutions should return NULL values, if possible.

Upvotes: 5

Answers (8)

Wahaj A Khan

Reputation: 111

-- hope you can manage the syntax error. but some logic like given below should work in this case where we can make period an origin to calculate other missing values. good luck!

declare @knownperiod int;
declare @knownperiodage int;
declare @agetop int;
declare @agebottom int;

@knownperiod = select top 1 period from table1 where period is not null
@knownperiodage = select top 1 age from table1 where period is not null



while(@knownperiodage >= 0)
begin 
@knownperiod = @knownperiod -1 ;
@knownperiodage = @knownperiodage -1;
update table1 set period = @knownperiod, year = YEAR(GetDate())+@knownperiod-1  where age = @knownperiodage
end

-- now for bottom age 
@knownperiod = select top 1 period from table1 where period is null or year is null
@knownperiodage = select top 1 age from table1 where period is null or year is null

while(@knownperiodage <= (Select max(age) from table1))
begin 
@knownperiod = @knownperiod +1 ;
@knownperiodage = @knownperiodage +1;
update table1 set period = @knownperiod, year = YEAR(GetDate())+@knownperiod-1  where age = @knownperiodage
end

Upvotes: 1

dnoeth

Reputation: 60472

You finally want three sequences with different start values. Then you simply need to calculate an offset and add it to age:

with cte as
 (
   select age
     ,max(period - age) over () + age as period -- adjusted period
     ,max(yr - age)     over () + age as yr     -- adjusted yr
   from #yourtable
 )
select age
  -- If a Period value sequence becomes negative the solutions should return NULL
  ,case when period >0 then period end as period
  ,yr
from cte

See fiddle

Upvotes: 2

gofr1

Reputation: 15977

Also you can use recursive CTE (it can handle any variation of data in the table except only one that has no populated period and year at all):

WITH cte AS ( -- get any filled period and year
    SELECT TOP 1 period - age delta,
                 [year]-period start_year
    FROM tablename
    WHERE period is not null and [year] is not null
), seq AS ( --get min and max age values
    SELECT MIN(age) as min_age, MAX(age) as max_age
    FROM tablename
), go_recursive AS (
    SELECT min_age age,
           min_age+delta period ,
           start_year+min_age+delta year,
           max_age
    FROM seq
    CROSS JOIN cte --That will generate the initial first row
    UNION ALL
    SELECT age + 1,
           period +1,
           year + 1,
           max_age
    FROM go_recursive 
    WHERE age < max_age --This part increments the data from first row
)
  
SELECT age,
       period,
       [year] 
FROM go_recursive 
OPTION (MAXRECURSION 0) 
-- If you know there are some limit of rows in that kind of tables 
--use this row count instead 0

Upvotes: 0

GMB

Reputation: 222482

This reads like a gaps-and-islands problem, where "empty" rows are the gaps and non-empty rows are the islands.

You want to fill the gaps. Your question is a bit tricky, because you do not clearly describe how to proceed when a gap row has both preceding and following islands - and what to do if they are not consistent.

Let me assume that you want to derive the value from the following island if there is one available, and fall back of the precedng island.

Here is an approach using lateral joins to retrieve the next and preceding non-empty row:

select t.age, 
    coalesce(t.period, n.period - n.diff, p.period - p.diff) period,
    coalesce(t.year,   n.year   - n.diff, p.year   - p.diff) year
from mytable t
outer apply (
    select top (1) t1.*, t1.age - t.age diff
    from mytable t1 
    where t1.age > t.age and t1.period is not null and t1.year is not null
    order by t1.age 
) n
outer apply (
    select top (1) t1.*, t1.age - t.age diff
    from mytable t1 
    where t1.age < t.age and t1.period is not null and t1.year is not null
    order by t1.age desc
) p
order by t.age

Actually, this would probably be more efficiently performed with window functions. We can implement the very same logic by building groups of records with window counts, then doing the computation within the groups:

select
    age,
    coalesce(
        period,
        max(period) over(partition by grp2) - max(age) over(partition by grp2) + age,
        max(period) over(partition by grp1) - min(age) over(partition by grp1) + age
    ) period,
    coalesce(
        year,
        max(year) over(partition by grp2) - max(age) over(partition by grp2) + age,
        max(year) over(partition by grp1) - min(age) over(partition by grp1) + age
    ) year
from (
    select t.*, 
        count(period) over(order by age) grp1, 
        count(period) over(order by age desc) grp2
    from mytable t
) t
order by age

Demo on DB Fiddle - both queries yield:

age | period | year
--: | -----: | ---:
  0 |     25 | 2044
  1 |     26 | 2045
  2 |     27 | 2046
  3 |     28 | 2047
  4 |     29 | 2048
  5 |     30 | 2049
  6 |     31 | 2050
  7 |     32 | 2051
  8 |     33 | 2052
  9 |     34 | 2053
 10 |     35 | 2054
 11 |     36 | 2055
 12 |     37 | 2056
 13 |     38 | 2057
 14 |     39 | 2058
 15 |     40 | 2059
 16 |     41 | 2060
 17 |     42 | 2061
 18 |     43 | 2062
 19 |     44 | 2063
 20 |     45 | 2064
 21 |     46 | 2065
 22 |     47 | 2066
 23 |     48 | 2067
 24 |     49 | 2068
 25 |     50 | 2069
 26 |     51 | 2070
 27 |     52 | 2071
 28 |     53 | 2072
 29 |     54 | 2073
 30 |     55 | 2074

Upvotes: 0

SteveC

Reputation: 6015

This solution takes 4 inputs:

@list_length -- (integer) the number of rows to generate (up to 12^5=248,832)
@start_age -- (integer) beginning age
@start_period -- (integer) beginning period
@start_year -- (integer) beginning year

For any combination of inputs this code generates the requested output. If either the Age or Year is calculated to be negative then it is converted to NULL. The current limit to the list length could be increased to whatever is necessary. The technique of creating a row_number using cross applied rows is known to be very fast when generating large sequences. Above about 500 rows it's always faster than a recursion based CTE. At small row numbers there's little to no performance difference between the two techniques.

Here are the code and output to match the example data.

Inputs

declare
  @list_length            int=31,
  @start_age              int=21,
  @start_period           int=46,
  @start_year             int=2065;

Code

with
n(n) as (select * from (values (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12)) v(n)),
tally_cte(n) as (
     select row_number() over (order by (select null)) 
     from n n1 cross join n n2 cross join n n3 cross join n n4 cross join n n5)
select p.Age, 
       case when p.[Period]<0 then null else p.[Period] end [Period], 
       case when p.[Year]<0 then null else p.[Year] end [Year]
from tally_cte t
     cross apply
     (select (t.n-1) [Age], (t.n-1)+(@start_period-@start_age) [Period],
             (t.n-1)+(@start_year-@start_age) [Year]) p
where n<=@list_length;

Output

Age Period  Year
0   25  2044
1   26  2045
2   27  2046
3   28  2047
4   29  2048
5   30  2049
6   31  2050
7   32  2051
8   33  2052
9   34  2053
10  35  2054
11  36  2055
12  37  2056
13  38  2057
14  39  2058
15  40  2059
16  41  2060
17  42  2061
18  43  2062
19  44  2063
20  45  2064
21  46  2065
22  47  2066
23  48  2067
24  49  2068
25  50  2069
26  51  2070
27  52  2071
28  53  2072
29  54  2073
30  55  2074

Suppose both the Period and the Year are less than the start Age. When the calculated values are negative the value is replaced with a NULL.

Inputs

declare
  @list_length            int=100,
  @start_age              int=10,
  @start_period           int=5,
  @start_year             int=8;

Output

Age Period  Year
0   NULL    NULL
1   NULL    NULL
2   NULL    0
3   NULL    1
4   NULL    2
5   0       3
6   1       4
7   2       5
8   3       6
9   4       7
10  5       8
11  6       9
12  7       10
...
99  94      97

Imo this is a flexible and efficient way to meet all of the requirements. Please let me know if there are any issues.

Upvotes: 0

sandeep rawat

Reputation: 4957

Tabel creation code

create table yourtable ( AGE int ,  Period  int, Year int )

insert into  yourtable
Select  0    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  1    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  2    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  3    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  4    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  5    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  6    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  7    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  8    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  9    AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  10   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  11   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  12   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  13   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  14   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  15   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  16   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  17   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  18   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  19   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  20   AS AGE , null  As Period   ,  null  As Year       UNION all 
Select  21   AS AGE ,46 As Period   ,2065    As Year       UNION all 
Select  22   AS AGE ,  null As Period   ,  null  As Year       UNION all 
Select  23   AS AGE ,  null As Period   ,  null  As Year       UNION all 
Select  24   AS AGE ,  null As Period   ,  null  As Year       UNION all 
Select  25   AS AGE ,   51  As Period   ,2070    As Year       UNION all 
Select  26   AS AGE ,  null As Period   ,  null  As Year       UNION all 
Select  27   AS AGE ,  null As Period   ,  null  As Year       UNION all 
Select  28   AS AGE ,  null As Period   ,  null  As Year       UNION all 
Select  29   AS AGE ,  null As Period   ,  null  As Year       UNION all 
Select  30   AS AGE ,  null As Period   ,  null  As Year

**Steps **

We need to get one row with non null value for Period and year.
Using age get first value for both the column .
Now just add respective age column value and fill full table .

Code to fix the serial

;with tmp as 
  (select top 1 * from yourtable  where  Period is not null and  year is not null)
 update yourtable
  set Period =  (tmp.Period - tmp.age) + yourtable.age
  , year =  (tmp.year - tmp.age) + yourtable.age
  from yourtable , tmp

Declare @age int  ,@Year int  ,@Period int 

select @age = age , @Year = year - (age +1)  ,@Period  = Period- (AGE +1)
from yourtable where  Period is not null and  year is not null

update  yourtable
set Period =@Period + age
,Year =@year + age
 from yourtable

Upvotes: 2

seanb

Reputation: 6685

Is the process to first calculate the increments (age -> period and age -> year) then simply add those increments to the age values? This assumes the differences between age and period, and age and year, are consistent across rows (just not filled in sometimes).

As such, you could use the following to first calculate the increments (PeriodInc, YrInc) and then select the values with the increments added (noting that if period goes negative, it gets NULL).

; WITH  PeriodInc AS (SELECT TOP 1 Period - Age AS PeriodInc FROM #yourtable WHERE Period IS NOT NULL),
        YrInc AS (SELECT TOP 1 Yr - Age AS YrInc FROM #yourtable WHERE Yr IS NOT NULL)
SELECT      Age, 
            CASE WHEN (Age + PeriodInc) >= 0 THEN (Age + PeriodInc) ELSE NULL END AS Period, 
            Age + YrInc AS Yr
    FROM    #yourtable
            CROSS JOIN PeriodInc
            CROSS JOIN YrInc

Here is a DB_Fiddle with the code

Upvotes: 0

ScaisEdge

Reputation: 133370

Seem you have always the same increment for age and year so

select age, isnull(period,age +25) Period,  isnull(year,age+44) year  
from yourtable

or the standard function coalesce (as suggested by Gordon Linoff)

select age, coalesce(period,age +25) Period,  coalesce(year,age+44) year  
from yourtable

Upvotes: 2

Fill in gaps in year sequence in SQL Server

Answers (8)

Related Questions