Reputation: 1144
I have a SAS dataset with 3 columns. A FirmIndex
, a ProducIndex
and a third column called PrChange
. Within each FirmIndex
& ProductIndex
group I want to count how many PrChange
are different from .
and from 0
, and add that to a column called Number
. Then I want to divide that column Number
by the number of observations within each group which are not .
.
Below an example of the dataset and desired output.
data prod;
input firmindex productindex PrChange Number Fract;
cards;
1 1 . 1 0.250
1 1 0.00 1 0.250
1 1 0.00 1 0.250
1 1 -0.40 1 0.250
1 1 0.00 1 0.250
1 2 . 2 1.000
1 2 1.00 2 1.000
1 2 0.30 2 1.000
1 3 . 4 0.800
1 3 0.70 4 0.800
1 3 1.00 4 0.800
1 3 0.70 4 0.800
1 3 0.00 4 0.800
1 3 -0.30 4 0.800
1 4 . 5 1.000
1 4 0.20 5 1.000
1 4 -1.00 5 1.000
1 4 -0.90 5 1.000
1 4 -0.50 5 1.000
1 4 1.00 5 1.000
2 1 . 2 1.000
2 1 0.30 2 1.000
2 1 -0.50 2 1.000
2 2 . 5 0.714
2 2 0.30 5 0.714
2 2 0.10 5 0.714
2 2 0.00 5 0.714
2 2 0.00 5 0.714
2 2 0.80 5 0.714
2 2 -0.20 5 0.714
2 2 0.40 5 0.714
2 3 . 1 1.000
2 3 0.60 1 1.000
2 4 . 5 0.714
2 4 -1.00 5 0.714
2 4 0.80 5 0.714
2 4 -0.20 5 0.714
2 4 0.00 5 0.714
2 4 0.00 5 0.714
2 4 -0.70 5 0.714
2 4 0.90 5 0.714
2 5 . 3 1.000
2 5 0.90 3 1.000
2 5 -0.70 3 1.000
2 5 -0.50 3 1.000
;
run;
Here is what I tried to generate the column number
, but it is not working:
data work.prod;
set work.prod;
by firmindex productindex;
if first.productindex or first.firmindex then sum = 0;
else if PrChange ne 0 and PrChange ne .;
sum = sum + 1;
run;
Upvotes: 2
Views: 4062
Reputation: 63424
I'm going to give the IML answer that I'm able to give. Rick or someone else more IML-savvy probably can do better than this. In R or other matrix language I think this would be much easier, but I don't have the IML chops to do this without looping; maybe it's possible.
proc iml;
use have;
read all var _all_ into h;
u = h[uniqueby(h,1:2), 1:2]; *generate the "unique" categories for the first two columns;
v = j(nrow(h),5); *generate a matrix to save this into;
v[,1:3] = h; *start it out with the first three columns of the dataset;
do i = 1 to nrow(u); *iterate over the unique category matrix;
number = ncol(loc(h[loc((h[,1:2] = u[i,1:2])[,#]),3]));
*the inner LOC produces a two column 1/0 matrix with match 1 / nomatch 0 for each col
then reduce to 1 column via subscript reduction product, to get correct 1/0 match vector
the outer LOC takes the rows of h from that (so rows of h matching u), then returns nonzero/nonmissing
which then ncol summarizes into a count;
fract_denom = ncol(loc(h[loc((h[,1:2] = u[i,1:2])[,#]),3] ^= .));
*similar, but here we have to verify they are not missing explicitly, considering 0 valid;
v[loc((v[,1:2] = u[i,1:2])[,#]),4] = number; *assign to col4 of V;
v[loc((v[,1:2] = u[i,1:2])[,#]),5] = number/fract_denom; *assign to col5 of V;
end;
print v;
quit;
This uses the unique-loc method, more or less, with some modifications; probably is an easier way to get the matches.
Upvotes: 1
Reputation: 63424
A SQL in SAS solution - Parfait's is probably the better one overall, but SAS's willingness to remerge makes the SASsy solution a bit simpler.
proc sql;
create table want as
select firmindex, productindex, prchange,
sum (not (prchange in (0,.))) as number,
calculated number / (sum ( not missing(prchange))) as fract
from have
group by firmindex, productindex;
quit;
SAS will do the grouping/counting/etc. and then merge back to the original dataset with no problem, skipping the need for correlated subqueries. NOT standard SQL, but quite common in SAS nonetheless.
Upvotes: 0
Reputation: 63424
Your problem here is that you need the number to divide by prior to running the rows of data. This is where SAS is different from Excel; SAS is row-based, meaning it takes your code and runs it against each row of data (more or less) one at a time, rather than dynamically looking at every cell from every other cell (like Excel). Much faster and more efficient but less flexible for stuff like this.
Your particular question begs for a DoW loop. This takes over the normal data step loop and performs its own loop - twice. Once to calculate the number/fract values, then once to copy those to the BY group. Note I only check for last.productIndex
; last/first transitions are always set on a second by variable when they're true for the first by variable.
Here we do the first loop once for the first set of values( the first 5 records) then we re-loop through the same 5 records. Then for the next 3. Etc. Each time the two loops take the same number of rows so they always stay in sync.
data want;
do _n_ = 1 by 1 until (last.productIndex);
set have;
by firmindex productindex;
number_denom = sum(number_Denom,not missing(PrChange));
number = sum(number, not (PrChange in (.,0)));
end;
fract = number/number_denom;
do _n_ = 1 by 1 until (last.productIndex);
set have;
by firmindex productindex;
output;
end;
run;
Upvotes: 1
Reputation: 107652
Consider proc sql
using conditional CASE WHEN
correlated subqueries:
proc sql;
create table ProdChangeCount as
SELECT p.firmindex, p.productindex,
(SELECT SUM(CASE WHEN sub.PrChange ^= . AND sub.PrChange ^= 0 THEN 1 ELSE 0 END)
FROM Prod sub
WHERE sub.firmindex = p.firmindex
AND sub.productindex = p.productindex) AS Number,
CALCULATED Number /
(SELECT Count(*)
FROM Prod sub
WHERE sub.PrChange ^= .
AND sub.firmindex = p.firmindex
AND sub.productindex = p.productindex) AS Frac
FROM Prod p;
quit;
Upvotes: -1