I have a situation where for each unique observation of casenum I would like to run varies queries and arithmatic operations between various observations of 'code' for that 'casenum' (see below). For example for casenum 1234567 I would like subtract data for code 0200 - code 0234 or 531 - 53. Please keep in mind that there are thousands of observations in this dataset. Is there an easy way to do this or to do row comparisons with the particular.
Please note casenum and code are character variables and data is a numeric variable
Here is an example of how the dataset is structured:
casenum code data 1234567 0123 4597 1234567 0234 53 1234567 0100 789 1234567 0200 531 1234567 0300 354 1111112 0123 79 1111112 0234 78 1111112 0100 77 1111112 0200 7954 1111112 0300 35
Here is the logic although likely syntactically incorrect of what I am trying to do.
For code observations where casenum is the same, within those casenums
I would like it to determine, if data for code 0234 + data for code 0100 - data for code 0123 ne data for code 0200 then newvariable = 'YES'
In other words I'd like it to test if 53 + 789 - 4597 ne 531. after that and other similar kinds of tests run within casenum 1234567, I'd like it to move onto the next casenum, and run those same tests for that casenum.
Keep in mind this dataset has hundreds of thousands of observations in it.
Upvotes: 0
Views: 5738
Reputation: 5913
If the formula is fixed (as your example seems to suggest), then there shouldn't be any reason that you can't do a straightforward transpose and then declare the test explicitly.
/* Transpose the data by casenum */
proc transpose data=so846572 out=transpose_ds;
id code;
var data;
by casenum;
run;
/* Now just explicitly write your conditional expression */
data StackOverflow;
set transpose_ds;
if _0234 + _0100 - _0123 <> _0200 then newvariable="yes";
run;
Where so846572 = Your original dataset, transpose_ds = Transposed version, StackOverflow = final output.
Let us know if this expression needs to be dynamic for some reason. This should easily scale to the volume of data you've mentioned without any problems. You could conceivably do the same kind of thing with a hash as well in one pass of the data.
Upvotes: 1
Reputation: 28441
I don't think I really have enough info from your question to help, but I will just throw this out.... If you want to do row comparison, you can also use the data step. Assuming you have your data sorted by casenum you can use first. and last. to determine when you have a new casenum and when you are on the last row of a casenum. If you want to sum up data values between rows or make decisions based on a previous row for a casenum listed multiple times.
Data work.temp ;
retain casenum_data ;
set lib.data ;
by casenum ;
if first.casenum then do ;
/* <reset hold vars> */
casenum_data = 0 ;
end ;
if code = "0200" or code = "234" then .....
if last.casenum then do ;
/* output casenum summary */
output ;
end ;
run ;
Post more info about need and more help can be given.
Upvotes: 0
Reputation: 1050
I'm unclear on what your logic is for the subtraction part of the code, but for the selection of a group of rows I can suggest. At first glance I would obtain a list of distinct values for casenum.
proc sql;
select distinct casenum
into :casenum_list separated by ' '
from dataset;
quit;
Now that you have a list of all distinct casenum values, I would iterate through the rows following whatever logic you need.
Possibly using another proc sql like:
%MACRO DOIT;
%LET COUNT=1;
%DO %UNTIL (%SCAN(&casenum_list,&COUNT) EQ);
%LET CASENUM_VAR=%SCAN(&casenum_list,&COUNT);
PROC SQL;
SELECT
<INSERT SOME SQL LOGIC HERE>
FROM
DATASET
WHERE CASENUM=&CASENUM_VAR;
QUIT;
%LET COUNT=%EVAL(&COUNT+1);
%MEND DOIT;
%DOIT;
I hope this helps. If you can provide more insight into what you are trying to accomplish within the rows, I can be more specific.
Upvotes: 1