Reputation: 33
I have large dataset of a few million patient encounters that include a diagnosis, timestamp, patientID, and demographic information.
For each patient, their diagnoses are listed on multiple lines. I need to exclude patients who have a certain diagnosis (282.1) and calculate incidence rates of other diseases in the year 2014.
IF diagnosis NE 282.1;
This in the data step does not work, because it does not take into account the other diagnoses on the other lines.
If possible, I would also like to calculate the incidence rates by disease.
This is an example of what the data looks like. There are multiple lines with multiple diagnoses.
PatientID Diagnosis Date Gender Age
1 282.1 1/2/10 F 25
1 232.1 1/2/10 F 87
1 250.02 1/2/10 F 41
1 125.1 1/2/10 F 46
1 90.1 1/2/10 F 58
2 140 12/15/13 M 57
2 132.3 12/15/13 M 41
2 149.1 12/15/13 M 66
3 601.1 11/19/13 F 58
3 231.1 11/19/13 F 76
3 123.1 11/19/13 F 29
4 282.1 12/30/14 F 81
4 130.1 12/30/14 F 86
5 230.1 1/22/14 M 60
5 282.1 1/22/14 M 46
5 250.02 1/22/14 M 53
Upvotes: 1
Views: 108
Reputation: 3845
You said you sorted the data first, probably like this
proc sort data=MYLIB.DIAGNOSES;
by PatientID;
run;
Assuming your data is ordered by patientID
, you can process each with the diagnose to exclude first.
data WORK.NOT_HAVING_282_1;
set MYLIB.DIAGNOSES (where=(diagnosis EQ 282.1))
MYLIB.DIAGNOSES (where=(diagnosis NE 282.1));
by PatientID;
As we need to report by year, not by date:
year = year(Date);
Next step is to exclude those you don't need, so you need to remember if the unwanted diagnose occured:
retain has_282_1;
if first.PatientID then has_282_1 = 0;
if diagnosis EQ 282.1 then has_282_1 = 1;
and then keep the other diagnoses in 2014 for patients that do not have 282.1
else if not has_282_1 then output;
run;
Next you could SQL to count what you need
proc sql:
create table MYLIB.STATISTICS as
select year, Diagonsis, count(distinct PatientID) as incidence
from WORK.NOT_HAVING_282_1
group by year, Diagonsis;
quit;
The above solution would take more processing power then needed:
DIAGNOSES
from diks, then write FIRST_282_1 to disk, just to read it back in again count(distinct PatientID)
, which is a costly operation. About diagnose 282.1, we only need to know who was ever diagnosed:
proc sort noduplicates
data=MYLIB.DIAGNOSES (where=(diagnosis EQ 282.1))
out=WORK.HAVING_282_1 (keep=PatientID);
by PatientID;
run;
About other diagnoses, we also need the year, which here:
data WORK.VIEW_OTHER / view=WORK.VIEW_OTHER;
set MYLIB.DIAGNOSES (where=(diagnosis NE 282.1));
year = year(Date);
keep PatientID year Diagnose;
run;
but as we use a view, we do not realy read and calculate anything before the view is used in this sort:
proc sort noduplicates
data=WORK.VIEW_OTHER (where=(diagnosis EQ 282.1))
out=WORK.OTHER_DIAGNOSES;
by PatientID year Diagnose;
run;
Now things become simpler. We use temproary variables exclude
and other
to indicate where data came from
data WORK.NOT_HAVING_282_1;
set WORK.HAVING_282_1 (in=exclude)
WORK.OTHER_DIAGNOSES (in=other);
by PatientID;
retain has_282_1;
if first.PatientID then has_282_1 = exclude;
if other and not has_282_1 then output;
run;
proc sql:
create table MYLIB.STATISTICS as
select year, Diagonsis, count(*) as incidence
from WORK.NOT_HAVING_282_1
group by year, Diagonsis;
quit;
Remark: this code is not tested
Upvotes: 1