Westworld
Westworld

Reputation: 300

Using Regex to extract multi-line SAS code

I am trying using Python to go through many thousands of lines of SAS code. I want to extract certain parts of the code to be printed or to be sent to another function.

The SAS code I am looking at might look like this:

"""%macro msg (name= some_macro) ;
%put Hello World, my name is &name ;
 %mend ;"""

And I want to capture what it between the first and the last line, i.e. between the %macro and the %mend ; line,so "%put Hello World, my name is &name ;" would be returned as a group.

I can achieve this capture with:

re.compile(r"\%macro\s*?.*?\s*?\((.*)\)\s*?;\n(.*?)\n\s*\%mend\s*;")

As (.*?)\n seems to match the line I want.

NOTE: I am using a lot of \s* because I see whitespace all over the SAS code which seems to be pretty random.

However when the SAS code is over more lines (it could be 2 or many more), I do not have the ability to pattern match, so for example,

"""%macro msg (name= some_macro) ;
%put Hello World, my name is &name ;
%let something happen
%do something else
%mend ;"""

Here I want to return "%put Hello World, my name is &name ; %let something happen %do something else" all as one group. I have tried putting in quantifiers, * and + but I do not know how to make it clear that want to check for the whole line repeating, rather than just the last character I put the quantifier next to. I will give this as an example:

r"\%macro\s*?.*?\s*?\((.*)\)\s*?;\n(.*?)\n+?\s*\%mend\s*;"

Here I am trying to indicate the line (.*?)\n could be repeated between 1 and unlimited times, and that I want to capture that group.

I have also tried to use re.MULTILINE and re.DOTALL, using ^ and $ and dots for line end charters, but didn't achieve the desired result either.

Please help me understand this area better. Thanks

Upvotes: 1

Views: 187

Answers (1)

The fourth bird
The fourth bird

Reputation: 163342

You could use a single capture group and match the lines that do not start with %mend.

The percentage sign does not need escaping and note that \s could also match a newline if that is not intended.

%macro.*\r?\n((?:(?!\s*%mend).*\r?\n)+)\s*%mend ;

Explanation

  • %macro.*\r?\n Match %macro followed by the rest of the line and a newline
  • ( Capture group 1
    • (?: Non capturing group
      • (?!\s*%mend) Negative lookahead, if what is on the right is not %mend
      • .*\r?\n Match the whole line and a newline
    • )+ Close non capturing group and repeat 1+ times to match at least a single line
  • ) Close capture group 1
  • \s*%mend ;

regex demo | Python demo

For example

pattern = re.compile(r"%macro.*\r?\n((?:(?!\s*%mend).*\r?\n)+)\s*%mend ;")
print(re.findall(pattern, test_str))

Upvotes: 1

Related Questions