Duna
Duna

Reputation: 735

Generating Leads and lags for non-consecutive time periods in SAS

With a SAS dataset like

Ob     x    year   pid  grp 
1    3.88   2001    1    a
2    2.88   2002    1    a 
3    0.13   2004    1    a 
4    3.70   2005    1    a 
5    1.30   2007    1    a 
6    0.95   2001    2    b 
7    1.79   2002    2    b 
8    1.59   2004    2    b 
9    1.29   2005    2    b 
10   0.96   2007    2    b 

I would like to get

Ob   x      year    pid grp grp   X_F1      XL1
1   3.88    2001    1   a    a   2.88        .
2   2.88    2002    1   a    a     .        3.88
3   0.13    2004    1   a    a    3.7        .
4   3.7     2005    1   a    a     .        0.13
5   1.3     2007    1   a    a     .         .
6   0.95    2001    2   b    b    1.79       .
7   1.79    2002    2   b    b     .        0.95
8   1.59    2004    2   b    b    1.29       .
9   1.29    2005    2   b    b     .        1.59
10  0.96    2007    2   b    b     .         .

where for observations with the same pid and each year t,

In my data set, not all pids have observations in successive years.

My attempt using the expand proc

proc expand data=have out=want method=none;
    by pid; id year;
    convert x = x_F1 / transformout=(lead 1);
    convert x = x_F2 / transformout=(lead 2);
    convert x = x_F3 / transformout=(lead 3);
    convert x = x_L1 / transformout=(lag 1);
    convert x = x_L2 / transformout=(lag 2);
    convert x = x_L3 / transformout=(lag 3);
run;

did not account for the fact that years are not consecutive.

Upvotes: 1

Views: 502

Answers (3)

Longfish
Longfish

Reputation: 7602

You could stick with proc expand to insert the missing years into your data (utilising the extrapolate statement). I've set the from value to day as this is a sequential integer check for days which will work with your data as YEAR is stored as an integer rather than a date.

Like the other answers, it requires 2 passes of the data, but I don't think there's an alternative to this.

data have;
input x year pid grp $;
datalines;
3.88   2001    1    a
2.88   2002    1    a 
0.13   2004    1    a 
3.70   2005    1    a 
1.30   2007    1    a 
0.95   2001    2    b 
1.79   2002    2    b 
1.59   2004    2    b 
1.29   2005    2    b 
0.96   2007    2    b 
;
run;

proc expand data = have out = have1
    method=none extrapolate
    from=day to=day;
by pid;
id year;
run;

proc expand data=have1 out=want method=none;
    by pid; id year;
    convert x = x_F1 / transformout=(lead 1);
    convert x = x_F2 / transformout=(lead 2);
    convert x = x_F3 / transformout=(lead 3);
    convert x = x_L1 / transformout=(lag 1);
    convert x = x_L2 / transformout=(lag 2);
    convert x = x_L3 / transformout=(lag 3);
run;

or this can be done in one go, subject to whether the value of x is important in the final dataset (see comment below).

proc expand data=have1 out=want1 method=none extrapolate from=day to=day;
    by pid; id year;
    convert x = x_F1 / transformout=(lead 1);
    convert x = x_F2 / transformout=(lead 2);
    convert x = x_F3 / transformout=(lead 3);
    convert x = x_L1 / transformout=(lag 1);
    convert x = x_L2 / transformout=(lag 2);
    convert x = x_L3 / transformout=(lag 3);
run;

Upvotes: 3

SRSwift
SRSwift

Reputation: 1710

Here is a simple approach using proc sql. It joins the data with itself twice; once for the forward and once for the backward lag, then takes the required values where they exist.

proc sql;
    create table want as
    select
        a.*,
        b.x as x_f1,
        c.x as x_l1
    from have as a
    left join have as b
        on a.pid = b.pid and a.year = b.year - 1
    left join have as c
        on a.pid = c.pid and a.year = c.year + 1
    order by 
        a.pid, 
        a.year;
run;

Caveats:

  • It will not expand too well with larger numbers of lags.
  • This is probably not the quickest approach.
  • It requires that there be only one observation for each pid year pair, and would need modifying if this is not the case.

Upvotes: 2

stallingOne
stallingOne

Reputation: 4006

  1. Sort your data per group and per year.
  2. compute x_F1 in a data step with a lag and a condition like this: if (year and lag(year) are consecutive) then x_F1=lag(x)
  3. Sort your date the other way around
  4. Compute x_L1 similarly.

I'm trying to write you a working code right now. If you provide me with a data sample (a data step with an infile e.g.), I can better try and test it.

This seems to work with my data:

/*1*/
proc sort data=WORK.QUERY_FOR_EPILABO_CLEAN_NODUP out=test1(where=(year<>1996)) nodupkey;
    by grp year;
run;

quit;

/*2*/
data test2;
    *retain x;
    set test1;
    by grp;
    x_L1=lag(x);
    if first.grp then
        x_L1=.;
    yeardif=dif(year);
    if (yeardif ne 1) then
        x_L1=.;
run;

/*3*/
proc sort data=test2(drop=yeardif) out=test3;
    by grp descending year;
run;

quit;

/*4*/
data test4;
    *retain x;
    set test3;
    by grp;
    x_F1=lag(x);
    if first.grp then
        x_F1=.;
    yeardif=dif(year);
    if (yeardif ne -1) then
        x_F1=.;
run;

Upvotes: 1

Related Questions