KH_
KH_

Reputation: 325

Matching and removing recurring group of words using REGEX in SAS

I am trying to remove group of repeating words in SAS. Basically, I am trying to remove set of words that reoccur. Forward slash is the delimeter. I am using SAS 9.4 and have the following example:

I tried the above regex and it works for 'Lymph node pain/Lymph node pain/Pain in extremity'. The result is 'Lymph node pain/Pain in extremity'. However it doesn't work for 'Lymph node pain/Pain in extremity/Pain in extremity' and 'Lymph node pain/Neuralgia/Neuralgia'. I'm not sure why.

data have;
  string = 'Lymph node pain/Pain in extremity/Pain in extremity';output;
  string = 'Lymph node pain/Lymph node pain/Pain in extremity'; output;
  string = 'Lymph node pain/Neuralgia/Neuralgia'; output;
run;

data test;
  set have;
     _1=prxparse('s/([A-Za-z].+?\s.*?\/.*?)(.*?)(\1+)/\2\3/i');
     _2=prxparse('/([A-Za-z].+?\s.*?\/.*?)(.*?)(\1+)/i');
    do i=1 to 10;
        string=prxchange(_1, -1, strip(string));
        if not prxmatch(_2, strip(string)) then leave;
    end;
   drop i  ;
run;

Any help is appreciated.

Upvotes: 0

Views: 104

Answers (1)

user667489
user667489

Reputation: 9569

Here's a scan-based approach. I've assumed that you have a max of 3 phrases per string, but this can easily be adjusted to work for any number of phrases if necessary.

data have;
  string = 'Lymph node pain/Pain in extremity/Pain in extremity';output;
  string = 'Lymph node pain/Lymph node pain/Pain in extremity'; output;
  string = 'Lymph node pain/Neuralgia/Neuralgia'; output;
  string = 'Neuralgia/Lymph node pain/Neuralgia'; output;  /*Added A/B/A example*/
run;

data test;
  set have;
  array phrases[3] $32;
  /*Separate string into an array of phrases delimited by / */
  do i = 1 to dim(phrases);
    phrases[i] = scan(string,i,'/');
  end;
  /*Sort the array so that duplicate phrases are next to each other*/
  call sortc(of phrases[*]);
  /*Iterate through the array and build up an output string of non-duplicates*/
  length outstring $255;
  do i = 1 to dim(phrases);
    if i = 1 then outstring = phrases[1];
    else if phrases[i] ne phrases[i-1] then outstring = catx('/',outstring,phrases[i]);
  end;
  keep string outstring;
run;

This has the side effect of sorting all the phrases into alphabetical order rather than order of first appearance in the string.

Upvotes: 1

Related Questions