Reputation: 592
me again. It's a bit tricky to explain this, but I'll try my best. I have a dataset that contains customer ID's, days in which those customers were called and a dummy column when 1 is when we talked to a person and 0 is when we didn't.
data Have;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines4;
001 30/11/2020 0
001 01/12/2020 1
001 30/12/2020 1
001 31/12/2020 1
001 01/01/2021 1
002 01/02/2021 1
002 02/02/2021 1
002 15/02/2021 1
;;;
run;
data Want;
input ID callDate :ddmmyy. Contacted ;
format callDate ddmmyy10.;
datalines4;
001 01/12/2020 1
001 01/01/2021 1
002 02/02/2021 1
002 15/02/2021 1
;;;
run;
I basically want to keep only records that a successful contact AND are not followed on the next day or the day after next by a successful contact.
For example, for consecutive days like these:
day 1 - 0
day 2 - 1
day 3 - 0
day 4 - 1
day 5 - 0
day 6 - 1
I would end up just keeping day 6.
I don't even know where to start really... in python I would make nested loops that goes through rows, but in proc sql
I'm not sure how to approach this.
If there is any idea at least to give me a push, I'd appreciate it.
Upvotes: 1
Views: 59
Reputation: 27508
You can select from a table using a correlated sub-query against itself.
SQL will perform all the looping you would have otherwise coded in python.
Example:
proc sql;
create table want as
select * from have as each
where
not exists (
select * from have as self
where self.calldate between each.calldate + 1 and each.calldate + 2 /* follow up */
and self.contacted = 1 /* with an assertion of contact */
)
;
Upvotes: 1
Reputation: 12909
There are a few ways to approach this, one of which is a data step. One thing SAS does not do well in the data step is look ahead; however, there are some great tricks out there for it. Here is a one-pass solution you can use that only uses the data step and is based on Andrew Gannon's SGF 2019 paper, Calculating Leads and Lags in SAS: One Problem, Many Solutions. This solution can be compacted with arrays but is expanded to make it a bit easier to understand the overall logic.
The basic idea: use some SAS functions to open up the dataset you're reading and fetch future observations within it using direct access. We can pull future values for each observation by simply using _N_+lead
. Since we've opened up the dataset we're currently reading in a set
statement, our obs fetches will always be in-sync for each row.
data want;
set have;
by id callDate;
retain _dsid_;
/* Open up your current dataset */
if(_N_ = 1) then _dsid_ = open('have');
/* Fetch 1 obs ahead of time */
_lead1_rc_ = fetchobs(_dsid_, _N_+1);
/* Fetch values from 1 obs ahead using the dataset we opened */
lead1_date = getvarn(_dsid_, varnum(_dsid_, "callDate"));
lead1_contacted = getvarn(_dsid_, varnum(_dsid_, "Contacted"));
lead1_id = getvarn(_dsid_, varnum(_dsid_, "id"));
/* Fetch2 obs ahead */
_lead2_rc_ = fetchobs(_dsid_, _N_+2);
lead2_date = getvarn(_dsid_, varnum(_dsid_, "callDate"));
lead2_contacted = getvarn(_dsid_, varnum(_dsid_, "Contacted"));
lead2_id = getvarn(_dsid_, varnum(_dsid_, "id"));
/* Set future values to missing if we've crossed to the next ID */
if(lead1_id NE id) then call missing(lead1_date, lead1_contacted);
if(lead2_id NE id) then call missing(lead2_date, lead2_contacted);
/* Keep only records that a successful contact AND are not followed on the next day or the day after next by a successful contact */
if(contacted AND NOT ( (lead1_contacted AND (lead1_date - callDate LE 2) OR (lead2_contacted AND (lead2_date - callDate LE 2) ) ) ) );
format lead1_date lead2_date date9.;
run;
Another solution can use the lead
function within PROC EXPAND
with the where
logic added as an output dataset option:
proc expand data=have
out=want(drop=time
where=(contacted AND NOT ( (lead1_contacted AND (lead1_date - callDate LE 2) OR (lead2_contacted AND (lead2_date - callDate LE 2) ) ) ) )
);
by id;
convert callDate = lead1_Date / transform=(lead 1);
convert callDate = lead2_Date / transform=(lead 2);
convert contacted = lead1_contacted / transform=(lead 1);
convert contacted = lead2_contacted / transform=(lead 2);
run;
Upvotes: 0