D. O.
D. O.

Reputation: 626

SAS - Built a new table by calling a macro with a variable of another table passed as parameter

I have written this code to do this :

  1. read records in the table "not_identified" one by one
  2. for one record pass the "name_firstname" variable to a macro named "mCalcul_lev_D33",
  3. then, the macro calculates the Levenstein between the variable passed as parameter and all the values of the variable "name_firstname_in_D33" in "data_all" table,
  4. if the Levenstein returns a value less or equal to "3", then the record of "data_all" is copied to "lev_D33" table.
rsubmit;
%macro mCalcul_lev_D33(theName);
  data result.lev_D33;
   set result.data_all;
   name_LEV=complev(&theName, name_firstname_in_D33);
   if name_LEV<=3 then output;
  run;
%mend mCalcul_lev_D33;
endrsubmit;

rsubmit;
data _null_;
  set result.not_identified;
  call execute ('%mCalcul_lev_D33('||name_firstname||')');
;
run;
endrsubmit;

There is 53700000 records in "data_all". The code is running since yesterday. Because I cannot see the result, I am asking :

  1. Is the code doing what I want?
  2. How coding if I want to write "name_firstname" (the variable passed like parameter) in the beginning of each record of "lev_D33"?

Thank you!

Upvotes: 1

Views: 85

Answers (2)

D. O.
D. O.

Reputation: 626

Thanks to @Richard. I have used your second example to write this code :

rsubmit;
  data result.lev_D33;
    set result.not_identified (firstobs=1 obs=10);
    do check_row = 1 to 1000000;
      set &lib..data_all (firstobs=1 obs=1000000) point=check_row;
      name_lev = complev (name_firstname, name_firstname_D3);
      if name_lev <= 3 then output;
    end;
  run;
endrsubmit ;

And it worked like I wanted.

In this example, I compare name_firstname in not_identified table to all name_firstname_D3 in data_all. If the COMPLEV is less or equal to 3, then the merge of the 2 records are in the result table "lev_D33" (one record from not_identified is merged to one record from data_all).

To do a test, I taked 10 records from not_identified and tried to find a concordance of the names and the firstnames in 1000000 data_all only.

Upvotes: 0

Richard
Richard

Reputation: 27508

D.O.:

I posit your macros are making the task more difficult than need be. There appears to be an coding problem in that each row in not_identified record will cause the result.lev_D33 to be rebuilt. If your long running program ever does finish, the lev_D33 output data set will correspond to only the last not_identified.

You are doing full outer join comparing ALL_COUNT * NOT_IDENT_COUNT rows in the process.

  • How many rows are in not_identified ?
    Hopefully far less than data_all.
  • Is the result libname pointing to a network drive or remote server ?
    Networking i/o can make things run a very long time and even win you a phone call from the network team.

A full outer join in DATA Step can be done with nested loops and a point= on the inner loop SET. In DATA Step the outer loop is the implicit loop.

Consider this sample code:

data all_data;
  do row = 1 to 100;
    length name_firstname $20;  
    name_firstname 
      =  repeat (byte(65 + mod(row,26)), 4*ranuni(123))
      || repeat(byte(65 + 26*ranuni(123)), 4*ranuni(123))
    ;
    output;
  end;
run;

data not_identified;
  do row = 1 to 10;
    length name_firstname $20;
    name_firstname = repeat (byte(65 + mod(row,26)), 10*ranuni(123));
    output;
  end;
run;


data lev33;
  set all_data;
  do check_row = 1 to check_count;
    set not_identified (keep=name_firstname rename=name_firstname=check_name)
        nobs=check_count 
        point=check_row
    ;
    name_lev = complev (check_name, name_firstname);
    if name_lev <= 3 then output;
  end;
run;

This approach tests each not_identified before moving to the next row. This is a useful method when the all_data is very large and you might want to process chunks of it at a time. Chunk processing is an appropriate place to start macro coding:

%macro do_chunk (FROM_OBS=, TO_OBS=);
  data lev33_&FROM_OBS._&TO_OBS;
    set all_data (firstobs=&FROM_OBS obs=&TO_OBS);
    do check_row = 1 to check_count;
      set not_identified (keep=name_firstname rename=name_firstname=check_name)
          nobs=check_count 
          point=check_row
      ;
      name_lev = complev (check_name, name_firstname);
      if name_lev <= 3 then output;
    end;
  run;
%mend;

%macro do_chunks;
  %local index;
  %do index = 1 %to 100 %by 10;
    %do_chunk ( FROM_OBS=&index, TO_OBS=%eval(&index+9) )
  %end;
%mend;

%do_chunks

You might shepherd the whole the process, bypassing do_chunks and manually invoking do_chunk for various ranges of your choosing.

Upvotes: 1

Related Questions