Reputation: 413
How can I calculate the moving average of a time series that has breaks using PROC EXPAND? I am looking for a more efficient way to do this calculation because a DATA steps and JOINS on large datasets take a very long time to execute due to server constrains.
Data:
data have(drop=i);
call streaminit(1);
do i = 1 to 20;
period = i;
if i > 10 then period = i + 5;
if i >17 then period = i + 6;
x = round(rand('uniform')*10,1.);
output;
end;
run;
│ period │ x │
├────────┼────┤
│ 1 │ 9 │
│ 2 │ 10 │
│ 3 │ 5 │
│ 4 │ 9 │
│ 5 │ 7 │
│ 11 │ 9 │
│ 12 │ 9 │
│ 13 │ 5 │
│ 15 │ 8 │
│ 16 │ 9 │
Notice that there are two break points in the period variable: 5-11 and 13-15.
Here is the expected result (3 month moving average):
Proc sql;
create table want as
select a.period, a.x
,mean(b.x) as x_avg format=10.2
from have as a
left join have as b
on a.period -3 < b.period <= a.period
group by 1,2;
Quit;
│ period │ x │ x_avg │
├────────┼────┼───────┤
│ 1 │ 9 │ 9.00 │
│ 2 │ 10 │ 9.50 │
│ 3 │ 5 │ 8.00 │
│ 4 │ 9 │ 8.00 │
│ 5 │ 7 │ 7.00 │
│ 11 │ 9 │ 9.00 │
│ 12 │ 9 │ 9.00 │
│ 13 │ 5 │ 7.67 │
│ 15 │ 8 │ 6.50 │
│ 16 │ 9 │ 8.50 │
Upvotes: 0
Views: 759
Reputation: 1394
You can make SQL faster by a little modification.
proc sql noprint;
create table want2 as
select a.period, a.x ,mean(b1.x,b2.x,a.x) as x_avg format=10.2
from have as a
left join have as b1 on a.period -2 = b1.period
left join have as b2 on a.period -1 = b2.period
order by a.period;
quit;
And more faster by data step.
data want3;
set have;
period_l2 = lag2(period);
period_l1 = lag(period);
x_l2 = ifn(period_l2=period-2,lag2(x),ifn(period_l1=period-2,lag(x),.));
x_l1 = ifn(period_l1=period-1,lag(x),.);
x_avg = mean(x_l2,x_l1,x);
run;
if the length of series is not 3 anymore, use arrays and mean(of _array_[*])
to help yourself.
Upvotes: 0
Reputation: 12909
Use proc timeseries
to add missing values between each gap, then run it through proc expand
with method=none
. We'll consider the interval daily since it's incrementing by one value at a time. Filter out your final dataset to have no missing values of x
.
proc timeseries data = have
out = have_ts;
id period interval=day setmissing=missing;
var x;
run;
proc expand data = have_ts
out = want(where=(NOT missing(x)) );
id period;
convert x = x_avg / method=none transform=(moveave 3);
run;
You'll need to reformat period
to 8.
with proc datasets
since proc timeseries
needs to treat it as a date.
proc datasets lib=work nolist;
modify want;
format x 8.;
quit;
Upvotes: 1