Reputation: 728
I am struggling to evaluate a SAS Perl regular expression to identify what is being replaced with what. I have gone through the SAS documentation on what each metacharacter represents. But could someone please help me easily identify what is being replaced with what in the below extression?
PRXPARSE('s/(^[0-9]+\s)|([#][0-9]+)|(\s[A-Z][0-9]+)|([`''\*\+\-\,\!"#])|([\.](?!BETA))|(DEALER[0-9]+)|(\s[0-9]{3,})|([0-9]+\s*$)//');
The above expression is part of a data step used in a SAS macro.
data &INPUT_DATA_SET (DROP=MATCH1 MATCH2 DEALER1 DEALER2);
set &INPUT_DATA_SET;
LENGTH DEALER1 $ 40. DEALER $ 40.;
if _N_ = 1 then MATCH1 = PRXPARSE('s/(^[0-9]+\s)|([#][0-9]+)|(\s[A-Z][0-9]+)|([`''\*\+\-\,\!"#])|([\.](?!COM))|(STORE[0-9]+)|(\s[0-9]{3,})|([0-9]+\s*$)//');
if _N_ = 1 then MATCH2 = PRXPARSE("s/\s+/ /");
RETAIN MATCH1;
RETAIN MATCH2;
call PRXCHANGE(MATCH1, -1, &DEALER_NAME_FIELD, DEALER1);
call PRXCHANGE(MATCH2, -1, DEALER1, DEALER);
run;
I request someone to kindly provide explanation on how the string replacement happens in the first PRXPARSE expression.
Thanks in advance. Naga Vemprala
Upvotes: 1
Views: 1355
Reputation: 1576
I've restructured the regex for explanation purpose only. Do not replace below code with your code since for multi-line regex concatenation w/ stripping of trailing and leading blanks needs to be done before passing to PRXPARSE function.
PRXPARSE('s/(^[0-9]+\s)| /* Number(with 1 or more digit) that starts the line gets selected by the regex */
([#][0-9]+)| /* or Any number(with 1 or more digit) that starts with a pound (#) sign gets selected with the # by the regex */
(\s[A-Z][0-9]+)| /* or Any number(with 1 or more digit) that starts with a space follwed by an Alphabet(single alphabet) gets selected with the space & alphabet by the regex */
([`''\*\+\-\,\!"#])| /* or occurrence of any of the following signs `'*+-,!"# would get selected */
([\.](?!COM))| /* or any . sign would get selected which does not follow COM string after that */
(STORE[0-9]+)| /* or STORE string followed by a number(with 1 or more digit) gets selected */
(\s[0-9]{3,})| /* or any number(minimum 3 digits and no max limit) preceded by a white space character, including space, tab, line break */
([0-9]+\s*$) /* or number(with 1 or more digit) followed by a white space character, including space, tab, line break followed by 0 or more spaces at the end of a line */
//'); /* Records matching any of the above group(selected from top to bottom) gets removed from the input variable */
Regex needs to written like 's/match/replacement/' for it to work for PRXCHANGE or CALL PRXCHANGE function. Also, since you're using CALL PRXCHANGE with -1 as the 2nd parameter hence any number of regular expression matches happening in the variable would be removed( due to no replacement used at the last part of regex) from the final variable.
I would suggest using online regex testing tools to run/validate/build the regex you write before running them in SAS. for e.g http://www.regexr.com/v1/ etc.
Hope this helps
Upvotes: 3