JPOLIVA
JPOLIVA

Reputation: 65

SAS - Loop through rows and calculate MD5

I want to sweep each table in a libname and calculate a hash over each row. For that purpose, i have already a table with libname, memname, concatenated columns with ',' and number of observations

libname memname columns num_obs
lib_A table_A col1a,col2a...colna 1
lib_A table_B col1b,col2b...colnb 2
lib_B table_C col1c,col2c...colnc 1

I first get all data into ranged macro variables (i think its easier to work, but could be wrong, ofc)

proc sql;
    select libname, memname, columns, num_obs
    into :lib1-, :tab1-, :column1-, :sqlobs1-
    from have
    where libname="&sel_livraria"; /*macro var = prompt from user*/
quit;

Just for developing guideline i made the code just to check one specific table without getting the row number of it since with a simple counter doesn't work (i get the order of the rows mess up each time i run) and it works for that purpose

%let lib=lib_A;
%let tab=table_B;
%let columns=col1b,col2b,colnb;
data want;
    length check $32.;
    format check $hex32.;
    set &lib..&tab;
    libname="&lib";
    memname="&tab";
    check = md5(cats(&columns));
    hash = put(check,$hex32.);
    keep libname memname hash;
    put hash;
    put _all_;
run;

So, what’s the best approach for getting a MD5 from each row (same order as tables) of all tables in a libname? I saw problems i couldn’t overcame using data steps, procs or macros.

The result i wanted if lib_A was selected in prompt were something like:

libname memname obs_row hash
lib_A table_A 1 64A29CCA15F53C83A9583841294A26AA
lib_A table_B 1 80DAC7B9854CF71A67F9C00A7EC4D9EF
lib_A table_B 2 0AC44CD79DAB2E33C93BB2312D3A9A40

Need some help.

Tks in advance.

Upvotes: 1

Views: 223

Answers (3)

Allan Bowe
Allan Bowe

Reputation: 12691

Stu's approach is nice, and will work most of the time but will fall over when you have wiiiide variables, a large number of variables, variables with large precision, and other edge cases.

So for the actual hashing part, you might consider this macro, which is extensively tested within Data Controller for SAS:

https://core.sasjs.io/mp__md5_8sas.html

Usage:

data _null_;
  set sashelp.class;
  hashvar=%mp_md5(cvars=name sex, nvars=age height weight);
  put hashvar=;
run;

Upvotes: 0

Tom
Tom

Reputation: 51566

I would make a macro that takes as input the four values in your metadata dataset. Note that commas are anathema to SAS programs, especially macro code, so make the macro so it can accept space delimited variable lists (like normal SAS program statements do).

To reduce the risk of name conflict I will name the variable using triple underscores and then rename them back to human friendly names when the dataset is written.

%macro next_ds(libname,memname,num_obs,varlist);
  data next_ds;
    length ___1 $8 ___2 $32 ___3 8 ___4 $32 ;
    ___1 = "&libname";
    ___2 = "&memname";
    ___3 + 1;
    set &libname..&memname(obs=&num_obs keep=&varlist);
    ___4 = put(md5(cats(of &varlist)),$hex32.);
    keep ___1-___4 ;
    rename ___1=libname ___2=memname ___3=obs_row ___4=hash;
  run;
%mend next_ds; 

Let's make some test metadata that reference datasets everyone should have.

data have;
  infile cards truncover ;
  input libname :$8. memname :$32. num_obs columns $200.;
cards;
sashelp class 3 name,sex,age
sashelp cars  2 make,model
;

And make sure the target dataset does not already exists.

%if %sysfunc(exist(want)) %then %do;
  proc delete data=want; run;
%end;

Now you can call that macro once for each observation in your source metadata dataset. There is no need to generated oodles of macro variables. Instead you can use CALL EXECUTE() to generate the macro calls directly from the dataset. We can replace the commas in the column lists when making the macro call. You can add in a PROC APPEND step after each macro call to aggregate the results into a single dataset.

data _null_;
  set have;
  call execute(cats(
     '%nrstr(%next_ds)(',libname,',',memname,',',num_obs
    ,',',translate(columns,' ',','),')'
  ));
  call execute('proc append data=next_ds base=want force; run;');
run;

Notice that wrapping the macro call in %NRSTR() makes the SAS log easier to read.

1    + %next_ds(sashelp,class,3,name sex age)
2    + proc append data=next_ds base=want force; run;
3    + %next_ds(sashelp,cars,2,make model)
4    + proc append data=next_ds base=want force; run;

Results:

Obs    libname    memname    obs_row                  hash

 1     sashelp     class        1       5425E9CEDA1DDEB71B2692A3C7050A8A
 2     sashelp     class        2       C532D227D358A3764C2D225DC8C02D18
 3     sashelp     class        3       13AD5F1517E0C4494780773B6DC15211
 4     sashelp     cars         1       777C60693BF5E16F38706C89301CD0A8
 5     sashelp     cars         2       07080C9321145395D1A2BCC10FBE6B83

Note that CATS() might not be the best method for generating the string to pass to the MD5() function. That can generate the same string for different combinations of the source variables. For example 'AB' || 'CD' is the same as 'A' || 'BCD'. Perhaps just use CAT() instead.

Upvotes: 0

Stu Sztukowski
Stu Sztukowski

Reputation: 12849

You're pretty close. This is how I would approach it. We'll create a macro with three parameters: data, lib, and out. data is the dataset you have with the column information. lib is the library you want to pull from your dataset, and out is the output dataset that you want to have.

We'll read each column into an individual macro variable:

  • memname1
  • memname2
  • memname3
  • libname1
  • libname2
  • libname3
  • etc.

From here, we simply need to loop over all of the macro variables and apply them where appropriate. We can easily count how many there are in a data step. All we need to do is add double-ampersands to resolve them correctly. For more information on why this is, check out this MWSUG paper.

%macro get_md5(data=, lib=, out=);

    /* Save all variables into macro variables:
       memname1 memname2 ...
       columns1 columns2 ...
    */
    data _null_;
        set &data.;
        where upcase(libname)=upcase("&lib.");

        call symputx(cats('memname', _N_), memname);
        call symputx(cats('columns', _N_), columns);
        call symputx(cats('obs', _N_), obs);
        call symputx('n_datasets', _N_);

    run;

    /* Loop through all the datasets and access each macro variable */
    %do i = 1 %to &n_datasets.;

        /* Double ampersand needed:
           First, resolve &i. to get &memname1
           Then resolve &mename1 to get the value stored in the macro variable memname1
        */
        %let memname = &&memname&i.;
        %let columns = &&columns&i.;
        %let obs     = &&obs&i.;

        /* Calculate md5 in a temporary dataset */
        data _tmp_;
            length lib     $8.
                   memname $32.
                   obs_row 8.
                   hash    $32.
            ;

            set &lib..&memname.(obs=&obs.);
            
            lib     = "&lib.";
            memname = "&memname.";
            obs_row = _N_;
            hash    = put(md5(cats(&columns.)), $hex32.);

            keep libname memname obs_row hash;
        run;

        /* Overwrite the dataset so we don't keep appending */
        %if(&i. = 1) %then %do;
            data &out.;
                set _tmp_;
            run;
        %end;
            %else %do;
                proc append base=&out. data=_tmp_;
                run;
            %end;
    %end;

    /* Remove temporary data */
    proc datasets lib=work nolist;
        delete _tmp_;
    quit;
%mend;

Example:

data have;
    length libname memname columns $15.;
    input libname$ memname$ columns$ obs;
    datalines;
sashelp cars make,model,msrp 1
sashelp class age,height,name 2
sashelp comet dose,length,sample 1
;
run;

%get_md5(data=have, lib=sashelp, out=want);

Output:

libname memname obs_row hash
sashelp cars    1       258DADA4843E7068ABAF95667E881B7F
sashelp class   1       29E8F4F03AD2275C0F191FE3DAA03778
sashelp class   2       DB664382B88BE7E445418B1A1C8CE13B
sashelp comet   1       210394B77E7696506FDEFD78890A8AB9

Upvotes: 0

Related Questions