Reputation: 1
Hi I have the following dataset. I am trying to find id who have at least 2 dx(411 or 430) within one year (dx1-dx5) and keep them in the dataset, otherwise drop. For example output in this case would be keeping id 1 and 3
data stat480.temp2;
input id $ date $9. dx1 $ dx2 $ dx3 $ dx4 $ dx5 $;
datalines;
1 1/1/2005 411 412 413 414 415
1 2/2/2005 411 412 413 414 415
2 1/1/2006 411 412 413 418 419
2 3/3/2006 420 421 422 423 424
3 3/4/2007 430 440 450 460 470
3 12/4/2006 430 510 520 530 560
;
run;
Upvotes: 0
Views: 105
Reputation: 10411
Still a little work remains to be done but this should work as a good starting point...
* First preping of data - eliminate rows that do not have dx411 nor dx430 ;
* and calculate time difference between rows of the same id) ;
data have;
informat id 1. date mmddyy10. dx1-dx5 3.;
format date yymmdd10. date_prev yymmdd10. delta_months 6.2;
retain date_prev (.) id_prev (.);
input id date dx1-dx5;
array dx(5);
if id = id_prev then
delta_months = intck("months", date_prev, date);
dx_411 = 411 in dx;
dx_430 = 430 in dx;
if dx_411=1 or dx_430=1 then output;
date_prev = date;
id_prev = id;
datalines;
1 1/1/2005 411 412 413 414 415
1 2/2/2005 411 412 413 414 415
2 1/1/2006 411 412 413 418 419
2 3/3/2006 420 421 422 423 424
3 12/4/2006 430 510 520 530 560
3 3/4/2007 430 440 450 460 470
;
Note that I have pre-sorted data so that older lines come first within an id group.
Identify and keep only id's showing on more than 1 row
proc sql;
create table want as
select id, date, dx1, dx2, dx3, dx4, dx5,
delta_months, dx_411, dx_430
from have
where id in (select id
from have
group by id
having count(id) > 1);
quit;
proc print;run;
What remains to be done:
filter out rows/ids for which delta_months is above your threshold.
filter out rows/ids that have, instead of 411-411 or 430-430, combinations 411-430 or 430-411. This could be done wither via proc sql or with a data step (incorporate it in the first one or in a subsequent independent one) using a retain statement. In this case-example, there were none of those.
Upvotes: 1
Reputation: 1120
I added a few extra rows to demonstrate that this works for more than two observations per year. Let me know if you have any questions.
data temp;
input id date $10. dx1 dx2 dx3 dx4 dx5;
datalines;
1 01/01/2005 411 412 413 414 415
1 02/02/2005 411 412 413 414 415
2 01/01/2006 411 412 413 418 419
2 03/03/2006 420 421 422 423 424
3 12/04/2006 430 510 520 530 560
3 03/04/2007 430 440 450 460 470
3 12/04/2007 430 510 520 530 560
3 12/05/2007 430 510 520 530 560
3 03/04/2008 411 440 450 460 470
;
run;
First, convert each row's date variable into a numeric format, and concatenate its dx variables (to more easily compare dx1 with dx4, for example):
data temp2;
set temp;
date_num = day(input(trim(date), mmddyy10.)) +
month(input(trim(date), mmddyy10.)) * 12 + year(input(trim(date), mmddyy10.)) * 365;
all_dx = catx(', ', dx1, dx2, dx3, dx4, dx5);
run;
Second, join this dataset to itself such that every observation is compared to every other observation within a year of it:
proc sql;
create table temp3 as select distinct
a.*, b.date_num as date_num2, b.all_dx as all_dx2
from temp2 as a
left join temp2 as b
on a.id = b.id and (b.date_num - 365) < a.date_num <= (b.date_num - 1);
quit;
Subset based on the 411 and 430 criteria:
data temp4;
set temp3 (where = (
(index(all_dx, "411") or index(all_dx, "430")) and
(index(all_dx2, "411") or index(all_dx2, "430"))));
run;
Select distinct IDs remaining:
proc sql;
create table temp5 as select distinct id from temp4;
quit;
Yielding the dataset:
1
3
Upvotes: 1