Reputation: 325
I am trying to remove group of repeating words in SAS. Basically, I am trying to remove set of words that reoccur. Forward slash is the delimeter. I am using SAS 9.4 and have the following example:
I tried the above regex and it works for 'Lymph node pain/Lymph node pain/Pain in extremity'. The result is 'Lymph node pain/Pain in extremity'. However it doesn't work for 'Lymph node pain/Pain in extremity/Pain in extremity' and 'Lymph node pain/Neuralgia/Neuralgia'. I'm not sure why.
data have; string = 'Lymph node pain/Pain in extremity/Pain in extremity';output; string = 'Lymph node pain/Lymph node pain/Pain in extremity'; output; string = 'Lymph node pain/Neuralgia/Neuralgia'; output; run; data test; set have; _1=prxparse('s/([A-Za-z].+?\s.*?\/.*?)(.*?)(\1+)/\2\3/i'); _2=prxparse('/([A-Za-z].+?\s.*?\/.*?)(.*?)(\1+)/i'); do i=1 to 10; string=prxchange(_1, -1, strip(string)); if not prxmatch(_2, strip(string)) then leave; end; drop i ; run;
Any help is appreciated.
Upvotes: 0
Views: 104
Reputation: 9569
Here's a scan
-based approach. I've assumed that you have a max of 3 phrases per string, but this can easily be adjusted to work for any number of phrases if necessary.
data have;
string = 'Lymph node pain/Pain in extremity/Pain in extremity';output;
string = 'Lymph node pain/Lymph node pain/Pain in extremity'; output;
string = 'Lymph node pain/Neuralgia/Neuralgia'; output;
string = 'Neuralgia/Lymph node pain/Neuralgia'; output; /*Added A/B/A example*/
run;
data test;
set have;
array phrases[3] $32;
/*Separate string into an array of phrases delimited by / */
do i = 1 to dim(phrases);
phrases[i] = scan(string,i,'/');
end;
/*Sort the array so that duplicate phrases are next to each other*/
call sortc(of phrases[*]);
/*Iterate through the array and build up an output string of non-duplicates*/
length outstring $255;
do i = 1 to dim(phrases);
if i = 1 then outstring = phrases[1];
else if phrases[i] ne phrases[i-1] then outstring = catx('/',outstring,phrases[i]);
end;
keep string outstring;
run;
This has the side effect of sorting all the phrases into alphabetical order rather than order of first appearance in the string.
Upvotes: 1