SAS: Exclude patients based on diagnoses on multiple lines and calculate incidence rates

Question

I have large dataset of a few million patient encounters that include a diagnosis, timestamp, patientID, and demographic information.

For each patient, their diagnoses are listed on multiple lines. I need to exclude patients who have a certain diagnosis (282.1) and calculate incidence rates of other diseases in the year 2014.

IF diagnosis NE 282.1;

This in the data step does not work, because it does not take into account the other diagnoses on the other lines.

If possible, I would also like to calculate the incidence rates by disease.

This is an example of what the data looks like. There are multiple lines with multiple diagnoses.

PatientID   Diagnosis   Date    Gender  Age
1           282.1       1/2/10      F   25
1           232.1       1/2/10      F   87
1           250.02      1/2/10      F   41
1           125.1       1/2/10      F   46
1           90.1        1/2/10      F   58
2           140         12/15/13    M   57
2           132.3       12/15/13    M   41
2           149.1       12/15/13    M   66
3           601.1       11/19/13    F   58
3           231.1       11/19/13    F   76
3           123.1       11/19/13    F   29
4           282.1       12/30/14    F   81
4           130.1       12/30/14    F   86
5           230.1       1/22/14     M   60
5           282.1       1/22/14     M   46
5           250.02      1/22/14     M   53

Dirk Horsten · Accepted Answer

Dual reading sollution

Straight forward version

You said you sorted the data first, probably like this

proc sort data=MYLIB.DIAGNOSES;
    by PatientID;
run;

Assuming your data is ordered by patientID, you can process each with the diagnose to exclude first.

data WORK.NOT_HAVING_282_1;
    set MYLIB.DIAGNOSES (where=(diagnosis EQ 282.1)) 
        MYLIB.DIAGNOSES (where=(diagnosis NE 282.1));
    by PatientID;

As we need to report by year, not by date:

year = year(Date);

Next step is to exclude those you don't need, so you need to remember if the unwanted diagnose occured:

    retain has_282_1;
    if first.PatientID then has_282_1 = 0;
    if diagnosis EQ 282.1 then has_282_1 = 1;

and then keep the other diagnoses in 2014 for patients that do not have 282.1

    else if not has_282_1 then output;
run;

Next you could SQL to count what you need

proc sql:
    create table MYLIB.STATISTICS as 
    select year, Diagonsis, count(distinct PatientID) as incidence
    from WORK.NOT_HAVING_282_1
    group by year, Diagonsis;
quit;

Improvements

The above solution would take more processing power then needed:

you read DIAGNOSES from diks, then write FIRST_282_1 to disk, just to read it back in again
you can keep multiple observations of the same diagnose at diffrerent dates in the same year for the same patient, so you need count(distinct PatientID), which is a costly operation.

About diagnose 282.1, we only need to know who was ever diagnosed:

proc sort noduplicates
    data=MYLIB.DIAGNOSES (where=(diagnosis EQ 282.1))
    out=WORK.HAVING_282_1 (keep=PatientID);
    by PatientID;
run;

About other diagnoses, we also need the year, which here:

data WORK.VIEW_OTHER / view=WORK.VIEW_OTHER;
    set MYLIB.DIAGNOSES (where=(diagnosis NE 282.1));
    year = year(Date);
    keep PatientID year Diagnose;
run;

but as we use a view, we do not realy read and calculate anything before the view is used in this sort:

proc sort noduplicates
    data=WORK.VIEW_OTHER (where=(diagnosis EQ 282.1))
    out=WORK.OTHER_DIAGNOSES;

    by PatientID year Diagnose;
run;

Now things become simpler. We use temproary variables exclude and other to indicate where data came from

data WORK.NOT_HAVING_282_1;
    set WORK.HAVING_282_1 (in=exclude)
        WORK.OTHER_DIAGNOSES (in=other);
    by PatientID;

    retain has_282_1;
    if first.PatientID then has_282_1 = exclude;

    if other and not has_282_1 then output;
run;

proc sql:
    create table MYLIB.STATISTICS as 
    select year, Diagonsis, count(*) as incidence
    from WORK.NOT_HAVING_282_1
    group by year, Diagonsis;
quit;

Remark: this code is not tested

SAS: Exclude patients based on diagnoses on multiple lines and calculate incidence rates

Answers (1)

Dual reading sollution

Straight forward version

Improvements

Related Questions