PriyamK
PriyamK

Reputation: 141

Removing duplicate substring in SAS

In the following sample data I am trying to remove the duplicate substrings in my string using the code below:

data z;
input pvd_name_orig $50.;
datalines;
MD SMITH, JOHN MD
SMITH, JOHN W
MD T SMITH, JOHN W.
SMITH, JOHN WILLIAM
JOHN N MD SMITH MD
MD JOHN W. SMITH MD
MD SMITH, MD JOHN
;
run;

DATA want (keep=pvd_name_orig pvd_name_temp);
SET pvd_pmd_md;
   pvd_name_temp=scan(pvd_name_orig, 1, ' ');
   do i=2 to countw(pvd_name_orig,' ');
      word=scan(pvd_name_orig, i, ' ');
      found=find(pvd_name_temp, word, 'it');
      if found=0 then pvd_name_temp=catx(' ', pvd_name_temp, word);
   end;
run;

However, the above code works well with all names except the names that have middle name initial letter that is repeated in the string. In that case it deletes the middle name initial considering it as a repetition. Is there a way I can avoid deleting single letter words in my string?

I have tried to manually add a period after middle name initial and then the code does not delete it in the new variable. However, I am unable to add a period after middle name initial using a SAS code. I have used the following code to add a period at the second position of the second word (randomly) but it only adds a period at the second character in the string.

data want;
   set z;

if length(compress(scan(pvd_name_orig,2,' '),'.'))=1 then substr(pvd_name_orig,2,1)='.';
   run;

My final desired output is

       Obs    pvd_name_orig           pvd_name_temp

        1     MD SMITH, JOHN MD       MD SMITH, JOHN
        2     SMITH, JOHN W           SMITH, JOHN W
        3     MD T. SMITH, JOHN W.    MD T. SMITH, JOHN W.
        4     SMITH, JOHN WILLIAM     SMITH, JOHN WILLIAM
        5     JOHN N MD SMITH MD      JOHN N MD SMITH
        6     MD JOHN W. SMITH MD     MD JOHN W. SMITH
        7     MD SMITH, MD JOHN       MD SMITH, JOHN

Any suggestions??

Upvotes: 0

Views: 521

Answers (1)

Shenglin Chen
Shenglin Chen

Reputation: 4554

Regex could be used to resolve the issue. here \b is word boundary, \S+ is string without space, (\b\S+\b) is a word without space; (.*) is anything, first (\1) is repeat word of (\b\S+\b) which needed to be delete; \1\2 means to keep (\b\S+\b)(.*).

data z;
input name $50.;
new_name=prxchange('s/(\b\S+\b)(.*)(\1)/\1\2/',-1,strip(name));
datalines;
MD SMITH, JOHN MD
SMITH, JOHN W
MD T SMITH, JOHN W.
SMITH, JOHN WILLIAM
JOHN N MD SMITH MD
MD JOHN W. SMITH MD
MD SMITH, MD JOHN
;
run;

Upvotes: 1

Related Questions