Reputation: 7788
I am new to SAS. I need to do x-iterations to populate my dataset called MYRS
.
Each iteration needs to JOIN TABLE1
with (TABLE2
+ MYRS
) MINUS the records which are already in MYRS
table.
Then, I need to update MYRS
table with additional matches. The GOAL is to track a chain of emails.
MYRS
is essentially a copy of TABLE1
and contains matching records. Kind of tricky. (simplified schema). Table1
Can have DUPS.
For example
TABLE1:
ID | EMAIL1 | EMAIL2 | EMAIL3 | EMAIL4|
1 | A | s | d | F
2 | g | F | j | L
3 | z | x | L | v
4 | z | x | L | v
2 | g | F | j | L
TABLE2:
EMAIL
A
MYRS (starts as empty dataset)
EMAIL1 | EMAIL2 | EMAIL3 | EMAIL4
Logic: TABLE1
has email that matches email in TABLE2
. Therefore this record need to show up. Other records don't match anything in TABLE2
. But because Record1
and Record2
share the same ALTERNATIVE email
F, Record2
also need to be shown. But because Record2
and Record3
share same alternative email L, Record3
also needs to be shown. And so fourth...
proc sql;
SELECT TABLE1.id,
TABLE1.email1,
TABLE1.email2,
TABLE1.email3,
TABLE1.email4
FROM TABLE1
INNER JOIN (
SELECT EMAIL
FROM TABLE2
UNION
SELECT EMAIL1 AS EMAIL
FROM MYRS
UNION
SELECT EMAIL2 AS EMAIL
FROM MYRS
UNION
SELECT EMAIL3 AS EMAIL
FROM MYRS
UNION
SELECT EMAIL4 AS EMAIL
FROM MYRS
)
ON EMAIL=EMAIL1 OR EMAIL=EMAIL2 OR EMAIL=EMAIL3 OR EMAIL=EMAIL4
WHERE TABLE1.id NOT IN (
SELECT DISTINCT ID
FROM MYRS
)
quit;
How can I create the following logic:
I am very new to SAS (3 days to be exact) and trying to put everything together. (I would use the logic above if I was to do that in Java)
Upvotes: 2
Views: 920
Reputation: 1710
Here is a macro approach, it mostly follows your logic but transforms your data first and the input/output is a list of ID
s (you can easily get to and from emails with this).
This code will probably introduce quite a few SAS features that you are unfamiliar with, but the comments and explanations below should help . If any of it is still unclear take a look at the links or add a comment.
It expects input data:
inData
: Your TABLE1 with ID
and EMAIL*
variablesmatched
: An initial list of known wanted ID
sIt returns:
matched
: An updated list of wanted ID
s/* Wrap the processing in a macro so that we can use a %do loop */
%macro looper(maxIter = 5);
/* Put all the emails in one column to make comparison simpler */
proc transpose data = inData out = trans (rename = (col1 = email));
by ID;
var email:;
run;
/* Initialise the counts for the %where condition */
%let _nMatched = 0;
%let nMatched = 1;
%let i = 0;
/* Loop until no new IDs are added (or maximum number of iterations) */
%do %while(&_nMatched. < &nMatched. and &i < &maxIter.);
%let _nMatched = &nMatched.;
%let i = %eval(&i. + 1);
%put NOTE: Loop &i.: &nMatched. matched.;
/* Move matches to a temporary table */
proc datasets library = work nolist nowarn;
delete _matched;
change matched = _matched;
quit;
/* Get new matched IDs */
proc sql noprint;
create table matched as
select distinct c.ID
from _matched as a
left join trans as b
on a.ID = b.ID
left join trans as c
on b.email = c.email;
/* Get new count */
select count(*) into :nMatched from matched;
quit;
%end;
%mend looper;
%looper(maxIter = 10);
The interesting bits are:
proc transpose
: Converts the input into a deep table so that all the email addresses are in one variable, this makes writing the email comparison logic simpler (less repetition needed) and puts the data in a format that will make it easier for you to clean the email addresses if necessary (think upcase()
, strip()
, etc.).%macro %mend
: The statements used to define a macro. This is necessary as you cannot use macro logic or loops in open code. I've also added an argument so you can see how that works.%let
and select into :
: Two ways to create macro variables. Macro variables are referenced with the prefix &
and are used to insert text into the SAS program before it is executed.%do %while() %end
: One of the ways to perform a loop within a macro. The code within will be run repeatedly until the condition evaluates to false.proc datasets
: A procedure for performing admin tasks on datasets and libraries. Used here to delete and rename temporary tables.Upvotes: 1