Reputation: 735
With a SAS
dataset like
Ob x year pid grp
1 3.88 2001 1 a
2 2.88 2002 1 a
3 0.13 2004 1 a
4 3.70 2005 1 a
5 1.30 2007 1 a
6 0.95 2001 2 b
7 1.79 2002 2 b
8 1.59 2004 2 b
9 1.29 2005 2 b
10 0.96 2007 2 b
I would like to get
Ob x year pid grp grp X_F1 XL1
1 3.88 2001 1 a a 2.88 .
2 2.88 2002 1 a a . 3.88
3 0.13 2004 1 a a 3.7 .
4 3.7 2005 1 a a . 0.13
5 1.3 2007 1 a a . .
6 0.95 2001 2 b b 1.79 .
7 1.79 2002 2 b b . 0.95
8 1.59 2004 2 b b 1.29 .
9 1.29 2005 2 b b . 1.59
10 0.96 2007 2 b b . .
where for observations with the same pid
and each year t
,
x_F1
is the value of x
in year t+1
and x_L1
is the value of x in year t-1
In my data set, not all pid
s have observations in successive years.
My attempt using the expand proc
proc expand data=have out=want method=none;
by pid; id year;
convert x = x_F1 / transformout=(lead 1);
convert x = x_F2 / transformout=(lead 2);
convert x = x_F3 / transformout=(lead 3);
convert x = x_L1 / transformout=(lag 1);
convert x = x_L2 / transformout=(lag 2);
convert x = x_L3 / transformout=(lag 3);
run;
did not account for the fact that years are not consecutive.
Upvotes: 1
Views: 502
Reputation: 7602
You could stick with proc expand
to insert the missing years into your data (utilising the extrapolate
statement). I've set the from
value to day
as this is a sequential integer check for days which will work with your data as YEAR is stored as an integer rather than a date.
Like the other answers, it requires 2 passes of the data, but I don't think there's an alternative to this.
data have;
input x year pid grp $;
datalines;
3.88 2001 1 a
2.88 2002 1 a
0.13 2004 1 a
3.70 2005 1 a
1.30 2007 1 a
0.95 2001 2 b
1.79 2002 2 b
1.59 2004 2 b
1.29 2005 2 b
0.96 2007 2 b
;
run;
proc expand data = have out = have1
method=none extrapolate
from=day to=day;
by pid;
id year;
run;
proc expand data=have1 out=want method=none;
by pid; id year;
convert x = x_F1 / transformout=(lead 1);
convert x = x_F2 / transformout=(lead 2);
convert x = x_F3 / transformout=(lead 3);
convert x = x_L1 / transformout=(lag 1);
convert x = x_L2 / transformout=(lag 2);
convert x = x_L3 / transformout=(lag 3);
run;
or this can be done in one go, subject to whether the value of x is important in the final dataset (see comment below).
proc expand data=have1 out=want1 method=none extrapolate from=day to=day;
by pid; id year;
convert x = x_F1 / transformout=(lead 1);
convert x = x_F2 / transformout=(lead 2);
convert x = x_F3 / transformout=(lead 3);
convert x = x_L1 / transformout=(lag 1);
convert x = x_L2 / transformout=(lag 2);
convert x = x_L3 / transformout=(lag 3);
run;
Upvotes: 3
Reputation: 1710
Here is a simple approach using proc sql
. It joins the data with itself twice; once for the forward and once for the backward lag, then takes the required values where they exist.
proc sql;
create table want as
select
a.*,
b.x as x_f1,
c.x as x_l1
from have as a
left join have as b
on a.pid = b.pid and a.year = b.year - 1
left join have as c
on a.pid = c.pid and a.year = c.year + 1
order by
a.pid,
a.year;
run;
Caveats:
pid
year
pair, and would need modifying if this is not the case. Upvotes: 2
Reputation: 4006
if (year and lag(year) are consecutive) then x_F1=lag(x)
I'm trying to write you a working code right now. If you provide me with a data sample (a data step with an infile e.g.), I can better try and test it.
This seems to work with my data:
/*1*/
proc sort data=WORK.QUERY_FOR_EPILABO_CLEAN_NODUP out=test1(where=(year<>1996)) nodupkey;
by grp year;
run;
quit;
/*2*/
data test2;
*retain x;
set test1;
by grp;
x_L1=lag(x);
if first.grp then
x_L1=.;
yeardif=dif(year);
if (yeardif ne 1) then
x_L1=.;
run;
/*3*/
proc sort data=test2(drop=yeardif) out=test3;
by grp descending year;
run;
quit;
/*4*/
data test4;
*retain x;
set test3;
by grp;
x_F1=lag(x);
if first.grp then
x_F1=.;
yeardif=dif(year);
if (yeardif ne -1) then
x_F1=.;
run;
Upvotes: 1