Mihai Hangiu
Mihai Hangiu

Reputation: 598

python backreference regex

I need to search something like this:

lines = """package p_dio_bfm is
   procedure setBFMCmd (  
      variable  pin : in tBFMCmd
      );
end p_dio_bfm; -- end package;

package body p_dio_bfm is
   procedure setBFMCmd (  
      variable  pin : in tBFMCmd
      ) is
   begin
      bfm_cmd := pin;
   end setBFMCmd;
end p_dio_bfm;"""

I need to extract the package name, i.e. p_dio_bfm and the package declaration, i.e. the part between "package p_dio_bfm is" and FIRST "end p_dio_bfm;"

The problem is that the package declaration may end with "end p_dio_bfm;" or "end package;" So I tried the following "OR" regex which: - works for packages ending with "end package" - does not work for packages ending with "end pck_name;"

pattern = re.compile("package\s+(\w+)\s+is(.*)end\s+(package|\1)\s*;")
match = pattern.search(lines)

The problem is the (package|\1) part of the regex, where I what to catch either the word "package" or the matched package name.

UPDATE: I have provided a full code that I hope will clarify it:

import re
lines1 = """package p_dio_bfm is
   procedure setBFMCmd (
      variable  pin : in tBFMCmd
      );
end p_dio_bfm;

package body p_dio_bfm is
   procedure setBFMCmd (
      variable  pin : in tBFMCmd
      ) is
   begin
      bfm_cmd := pin;
   end setBFMCmd;
end p_dio_bfm;"""

lines2 = """package p_dio_bfm is
   procedure setBFMCmd (
      variable  pin : in tBFMCmd
      );
end package;

package body p_dio_bfm is
   procedure setBFMCmd (
      variable  pin : in tBFMCmd
      ) is
   begin
      bfm_cmd := pin;
   end setBFMCmd;
end package;"""

lines1 = lines1.replace('\n', ' ')
print lines1

pattern = re.compile("package\s+(\w+)\s+is(.*)end\s+(package|\1)\s*;")
match = pattern.search(lines1)

print match

lines2 = lines2.replace('\n', ' ')
print lines2

match = pattern.search(lines2)

print match

I expect in both cases, using a unique regex, to get back this part:

"""procedure setBFMCmd (
          variable  pin : in tBFMCmd
          );"""  

without the \n chars which I have removed.

Upvotes: 4

Views: 193

Answers (2)

bufh
bufh

Reputation: 3410

How about:

>>> for row in re.findall(
...   r'package(?:\s.*?)(?P<needle>[^\s]+)\s+is\s+(.*?)end\s+(?:package|(?P=needle));',
...   lines,
...   re.S
... ):
...   print '{{{', row[1], '}}}'
...
{{{ procedure setBFMCmd (
      variable  pin : in tBFMCmd
      );
}}}
{{{ procedure setBFMCmd (
      variable  pin : in tBFMCmd
      ) is
   begin
      bfm_cmd := pin;
   end setBFMCmd;
}}}

I took the liberty to not filter exactly how @mihai-hangiu asked by including the second block.

Upvotes: 2

Kasravnd
Kasravnd

Reputation: 107297

Your regex doesn't match anything since it's incorrect.Without using multi-line flag .* won't match new line character,so instead you can use [\s\S]* :

r'package ([^\s]+)\s+is([\s\S]*)end\s+(package|\1)\s*;'

See demo https://regex101.com/r/tZ3uH0/1

But there is some another problems here one that your string contains 2 package block and and this point that as a more elegant and efficient way you can sue re.DOTALL flag which make the '.' special character match any character at all, including a newline.So you can write your regex like following :

pattern = re.compile("package\s+(\w+)\s+is(.*)end\s+(package|\1)\s*;",re.DOTALL)

But this still will match the first block :

>>> match = pattern.search(lines)
>>> print match.group(0)
package p_dio_bfm is
   procedure setBFMCmd (  
      variable  pin : in tBFMCmd
      );
end p_dio_bfm; -- end package;
>>> print match.group(1)
p_dio_bfm
>>> print match.group(2)

   procedure setBFMCmd (  
      variable  pin : in tBFMCmd
      );
end p_dio_bfm; -- 
>>> print match.group(3)
package

For match all blocks you need to clarify the words like body in second group :

package\s+(?:\w+\s+?)?([^\s]+)\s+is(.*?)end\s+(package|\1)\s*;

See demo https://regex101.com/r/tZ3uH0/3

Upvotes: 3

Related Questions