Reputation: 626
I have written this code to do this :
rsubmit; %macro mCalcul_lev_D33(theName); data result.lev_D33; set result.data_all; name_LEV=complev(&theName, name_firstname_in_D33); if name_LEV<=3 then output; run; %mend mCalcul_lev_D33; endrsubmit; rsubmit; data _null_; set result.not_identified; call execute ('%mCalcul_lev_D33('||name_firstname||')'); ; run; endrsubmit;
There is 53700000 records in "data_all". The code is running since yesterday. Because I cannot see the result, I am asking :
Thank you!
Upvotes: 1
Views: 85
Reputation: 626
Thanks to @Richard. I have used your second example to write this code :
rsubmit;
data result.lev_D33;
set result.not_identified (firstobs=1 obs=10);
do check_row = 1 to 1000000;
set &lib..data_all (firstobs=1 obs=1000000) point=check_row;
name_lev = complev (name_firstname, name_firstname_D3);
if name_lev <= 3 then output;
end;
run;
endrsubmit ;
And it worked like I wanted.
In this example, I compare name_firstname in not_identified table to all name_firstname_D3 in data_all. If the COMPLEV is less or equal to 3, then the merge of the 2 records are in the result table "lev_D33" (one record from not_identified is merged to one record from data_all).
To do a test, I taked 10 records from not_identified and tried to find a concordance of the names and the firstnames in 1000000 data_all only.
Upvotes: 0
Reputation: 27508
D.O.:
I posit your macros are making the task more difficult than need be. There appears to be an coding problem in that each row in not_identified record will cause the result.lev_D33
to be rebuilt. If your long running program ever does finish, the lev_D33 output data set will correspond to only the last not_identified.
You are doing full outer join comparing ALL_COUNT * NOT_IDENT_COUNT rows in the process.
not_identified
?result
libname pointing to a network drive or remote server ?A full outer join in DATA Step can be done with nested loops and a point= on the inner loop SET. In DATA Step the outer loop is the implicit loop.
Consider this sample code:
data all_data;
do row = 1 to 100;
length name_firstname $20;
name_firstname
= repeat (byte(65 + mod(row,26)), 4*ranuni(123))
|| repeat(byte(65 + 26*ranuni(123)), 4*ranuni(123))
;
output;
end;
run;
data not_identified;
do row = 1 to 10;
length name_firstname $20;
name_firstname = repeat (byte(65 + mod(row,26)), 10*ranuni(123));
output;
end;
run;
data lev33;
set all_data;
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
This approach tests each not_identified before moving to the next row. This is a useful method when the all_data is very large and you might want to process chunks of it at a time. Chunk processing is an appropriate place to start macro coding:
%macro do_chunk (FROM_OBS=, TO_OBS=);
data lev33_&FROM_OBS._&TO_OBS;
set all_data (firstobs=&FROM_OBS obs=&TO_OBS);
do check_row = 1 to check_count;
set not_identified (keep=name_firstname rename=name_firstname=check_name)
nobs=check_count
point=check_row
;
name_lev = complev (check_name, name_firstname);
if name_lev <= 3 then output;
end;
run;
%mend;
%macro do_chunks;
%local index;
%do index = 1 %to 100 %by 10;
%do_chunk ( FROM_OBS=&index, TO_OBS=%eval(&index+9) )
%end;
%mend;
%do_chunks
You might shepherd the whole the process, bypassing do_chunks
and manually invoking do_chunk
for various ranges of your choosing.
Upvotes: 1