Recursively add to a data table in SAS

Question

I am new to SAS. I need to do x-iterations to populate my dataset called MYRS.

Each iteration needs to JOIN TABLE1 with (TABLE2+ MYRS) MINUS the records which are already in MYRS table.

Then, I need to update MYRS table with additional matches. The GOAL is to track a chain of emails.

MYRS is essentially a copy of TABLE1 and contains matching records. Kind of tricky. (simplified schema). Table1 Can have DUPS.

For example

TABLE1:
ID | EMAIL1 | EMAIL2 | EMAIL3 | EMAIL4|
1 | A | s | d | F
2 | g | F | j | L
3 | z | x | L | v
4 | z | x | L | v
2 | g | F | j | L

TABLE2:
EMAIL
A

MYRS (starts as empty dataset)
EMAIL1 | EMAIL2 | EMAIL3 | EMAIL4

Logic: TABLE1 has email that matches email in TABLE2. Therefore this record need to show up. Other records don't match anything in TABLE2. But because Record1 and Record2 share the same ALTERNATIVE email F, Record2 also need to be shown. But because Record2 and Record3 share same alternative email L, Record3 also needs to be shown. And so fourth...

proc sql;        
SELECT TABLE1.id,
    TABLE1.email1,
    TABLE1.email2,
    TABLE1.email3,
    TABLE1.email4
FROM TABLE1
INNER JOIN (
    SELECT EMAIL
    FROM TABLE2     
     UNION      
    SELECT EMAIL1 AS EMAIL
    FROM MYRS   
     UNION      
    SELECT EMAIL2 AS EMAIL
    FROM MYRS   
     UNION      
    SELECT EMAIL3 AS EMAIL
    FROM MYRS   
     UNION      
    SELECT EMAIL4 AS EMAIL
    FROM MYRS
    )
ON EMAIL=EMAIL1 OR EMAIL=EMAIL2 OR EMAIL=EMAIL3 OR EMAIL=EMAIL4
WHERE TABLE1.id NOT IN (
        SELECT DISTINCT ID
        FROM MYRS
        )
quit;

How can I create the following logic:

Wrap this into some sort of function
Before sql execution, count amount of records in MYDS and SAVE the count
Execute SQL and update MYDS
Count amount of records in MYDS
If MYDS count did not change, stop execution
Else, goto #3

I am very new to SAS (3 days to be exact) and trying to put everything together. (I would use the logic above if I was to do that in Java)

SRSwift · Accepted Answer

Here is a macro approach, it mostly follows your logic but transforms your data first and the input/output is a list of IDs (you can easily get to and from emails with this).

This code will probably introduce quite a few SAS features that you are unfamiliar with, but the comments and explanations below should help . If any of it is still unclear take a look at the links or add a comment.

It expects input data:

inData: Your TABLE1 with ID and EMAIL* variables
matched: An initial list of known wanted IDs

It returns:

matched: An updated list of wanted IDs

/* Wrap the processing in a macro so that we can use a %do loop */
%macro looper(maxIter = 5);
    /* Put all the emails in one column to make comparison simpler */
    proc transpose data = inData out = trans (rename = (col1 = email));
        by ID;
        var email:;
    run;        
    /* Initialise the counts for the %where condition */
    %let _nMatched = 0;
    %let nMatched = 1;
    %let i = 0;
    /* Loop until no new IDs are added (or maximum number of iterations) */
    %do %while(&_nMatched. < &nMatched. and &i < &maxIter.); 
        %let _nMatched = &nMatched.;
        %let i = %eval(&i. + 1);
        %put NOTE: Loop &i.: &nMatched. matched.;
        /* Move matches to a temporary table */
        proc datasets library = work nolist nowarn;
            delete _matched;
            change matched = _matched;
        quit;
        /* Get new matched IDs */
        proc sql noprint;
            create table matched as
            select distinct c.ID
            from _matched as a
            left join trans as b
                on a.ID = b.ID
            left join trans as c
                on b.email = c.email;
            /* Get new count */
            select count(*) into :nMatched from matched;
        quit;
    %end;
%mend looper;
%looper(maxIter = 10);

The interesting bits are:

proc transpose: Converts the input into a deep table so that all the email addresses are in one variable, this makes writing the email comparison logic simpler (less repetition needed) and puts the data in a format that will make it easier for you to clean the email addresses if necessary (think upcase(), strip(), etc.).
%macro %mend: The statements used to define a macro. This is necessary as you cannot use macro logic or loops in open code. I've also added an argument so you can see how that works.
%let and select into :: Two ways to create macro variables. Macro variables are referenced with the prefix & and are used to insert text into the SAS program before it is executed.
%do %while() %end: One of the ways to perform a loop within a macro. The code within will be run repeatedly until the condition evaluates to false.
proc datasets: A procedure for performing admin tasks on datasets and libraries. Used here to delete and rename temporary tables.

Recursively add to a data table in SAS

Answers (1)

Related Questions