Reputation: 1
I am trying to find cumulative sum for my dataset based on name. I have following 2 methods of doing so with very little difference.
The dataset looks like below
1 David 45
1 David 74
2 Sam 45
2 Ram 54
3 Bane 87
3 Mary 92
3 Bane 87
4 Dane 23
5 Jenny 87
5 Ken 87
6 Simran 63
8 Priya 72
PROC SORT DATA = READIN;
BY ID;
RUN;
DATA READIN1;
SET READIN;
BY ID;
if first.id then
cum_score=score;
else cum_score+score;
run;
DATA READIN2;
SET READIN;
if first.id then
cum_score=score;
else score+cum_score;
run;
While the first method gives me correct cumulative frequency, also suggesting that in BY group processing, newly created variables are auto retained, the second method gives incorrect results suggesting new variables aren't auto retained I by group processing. Interestingly, the only change in the two methods are the cum_score+score is changed to score+cum_score(which shall give same result according to laws of addition)
Please explain what is happening here.
Upvotes: 0
Views: 80
Reputation: 51601
This has nothing to do with the BY processing. It is about the meaning of a sum statement and the impact of trying to retain variables that are being loaded by a SET statement.
A sum statement is of the form
variable + expression ;
It is essentially equivalent to these two statements:
retain variable 0 ;
variable = sum(variable,expression);
When you retain a variable then SAS will not reset it to missing at the start of the next data set iteration. So the value it had at the end of iteration N is "retained" to the start of iteration N+1. But if that variable is included from a dataset then the retained value will be lost when it is overwritten by the value read from the dataset.
Also note that all variables that are defined in input datasets are automatically retained, but since normally they are overwritten when the SET (or MERGE, UPDATE, etc) statement runs you don't notice it.
Upvotes: 1
Reputation: 21274
cum_score + score
and score + cum_score
are mathematically the same, but not the same in SAS.
A SUM statement has an implicit retain, but it matters what order the variables are in the statement. This is because variables that already exist, are not retained, at least not how you'd expect it to be retained, and if cum_score is first in the statement it will be retained.
Upvotes: 0