Reputation: 141
In the following sample data I am trying to remove the duplicate substrings in my string using the code below:
data z;
input pvd_name_orig $50.;
datalines;
MD SMITH, JOHN MD
SMITH, JOHN W
MD T SMITH, JOHN W.
SMITH, JOHN WILLIAM
JOHN N MD SMITH MD
MD JOHN W. SMITH MD
MD SMITH, MD JOHN
;
run;
DATA want (keep=pvd_name_orig pvd_name_temp);
SET pvd_pmd_md;
pvd_name_temp=scan(pvd_name_orig, 1, ' ');
do i=2 to countw(pvd_name_orig,' ');
word=scan(pvd_name_orig, i, ' ');
found=find(pvd_name_temp, word, 'it');
if found=0 then pvd_name_temp=catx(' ', pvd_name_temp, word);
end;
run;
However, the above code works well with all names except the names that have middle name initial letter that is repeated in the string. In that case it deletes the middle name initial considering it as a repetition. Is there a way I can avoid deleting single letter words in my string?
I have tried to manually add a period after middle name initial and then the code does not delete it in the new variable. However, I am unable to add a period after middle name initial using a SAS code. I have used the following code to add a period at the second position of the second word (randomly) but it only adds a period at the second character in the string.
data want;
set z;
if length(compress(scan(pvd_name_orig,2,' '),'.'))=1 then substr(pvd_name_orig,2,1)='.';
run;
My final desired output is
Obs pvd_name_orig pvd_name_temp
1 MD SMITH, JOHN MD MD SMITH, JOHN
2 SMITH, JOHN W SMITH, JOHN W
3 MD T. SMITH, JOHN W. MD T. SMITH, JOHN W.
4 SMITH, JOHN WILLIAM SMITH, JOHN WILLIAM
5 JOHN N MD SMITH MD JOHN N MD SMITH
6 MD JOHN W. SMITH MD MD JOHN W. SMITH
7 MD SMITH, MD JOHN MD SMITH, JOHN
Any suggestions??
Upvotes: 0
Views: 521
Reputation: 4554
Regex could be used to resolve the issue. here \b is word boundary, \S+ is string without space, (\b\S+\b) is a word without space; (.*)
is anything, first (\1) is repeat word of (\b\S+\b) which needed to be delete; \1\2 means to keep (\b\S+\b)(.*).
data z;
input name $50.;
new_name=prxchange('s/(\b\S+\b)(.*)(\1)/\1\2/',-1,strip(name));
datalines;
MD SMITH, JOHN MD
SMITH, JOHN W
MD T SMITH, JOHN W.
SMITH, JOHN WILLIAM
JOHN N MD SMITH MD
MD JOHN W. SMITH MD
MD SMITH, MD JOHN
;
run;
Upvotes: 1