Joshua
Joshua

Reputation: 269

python multiline regex

I'm having an issue compiling the correct regular expression for a multiline match. Can someone point out what I'm doing wrong. I'm looping through a basic dhcpd.conf file with hundreds of entries such as:

host node20007                                                                                                                  
{                                                                                                                              
    hardware ethernet 00:22:38:8f:1f:43;                                                                                       
    fixed-address node20007.domain.com;     
}

I've gotten various regex's to work for the MAC and fixed-address but cannot combine them to match properly.

f = open('/etc/dhcp3/dhcpd.conf', 'r')
re_hostinfo = re.compile(r'(hardware ethernet (.*))\;(?:\n|\r|\r\n?)(.*)',re.MULTILINE)

for host in f:
match = re_hostinfo.search(host)
    if match:
        print match.groups()

Currently my match groups will look like:
('hardware ethernet 00:22:38:8f:1f:43', '00:22:38:8f:1f:43', '')

But looking for something like:
('hardware ethernet 00:22:38:8f:1f:43', '00:22:38:8f:1f:43', 'node20007.domain.com')

Upvotes: 7

Views: 14035

Answers (2)

ghostdog74
ghostdog74

Reputation: 343067

Sometimes, the easier method is not using regex. Just an example

for line in open("dhcpd.conf"):
    line = line.rstrip()
    sline = line.split()
    if "hardware ethernet" or "fixed-address" in line:
       print sline[-1]

another way

data = open("file").read().split("}");
for item in data:
    item = [ i.strip() for i in item.split("\n") if i != '' ];
    for elem in item:
       if "hardware ethernet" in elem:
           print elem.split()[-1]
    if item: print  item[-1]

output

$ more file
host node20007
{
    hardware ethernet 00:22:38:8f:1f:43;
        fixed-address node20007.domain.com;
}

host node20008
{
    hardware ethernet 00:22:38:8f:1f:44;
        some-address node20008.domain.com;
}

$ python test.py
00:22:38:8f:1f:43;
fixed-address node20007.domain.com;
00:22:38:8f:1f:44;
some-address node20008.domain.com;

Upvotes: 0

John Machin
John Machin

Reputation: 83032

Update I've just noticed the real reason that you are getting the results that you got; in your code:

for host in f:
    match = re_hostinfo.search(host)
    if match:
        print match.groups()

host refers to a single line, but your pattern needs to work over two lines.

Try this:

data = f.read()
for x in regex.finditer(data):
    process(x.groups())

where regex is a compiled pattern that matches over two lines.

If your file is large, and you are sure that the pieces of interest are always spread over two lines, then you could read the file a line at a time, check the line for the first part of the pattern, setting a flag to tell you whether the next line should be checked for the second part. If you are not sure, it's getting complicated, maybe enough to start looking at the pyparsing module.

Now back to the original answer, discussing the pattern that you should use:

You don't need MULTILINE; just match whitespace. Build up your pattern using these building blocks:

(1) fixed text (2) one or more whitespace characters (3) one or more non-whitespace characters

and then put in parentheses to get your groups.

Try this:

>>> m = re.search(r'(hardware ethernet\s+(\S+));\s+\S+\s+(\S+);', data)
>>> print m.groups()
('hardware ethernet   00:22:38:8f:1f:43', '00:22:38:8f:1f:43', 'node20007.domain.com')
>>>

Please consider using "verbose mode" ... you can use it to document exactly which pieces of pattern match which pieces of data, and it can often help getting the pattern right in the first place. Example:

>>> regex = re.compile(r"""
... (hardware[ ]ethernet \s+
...     (\S+) # MAC
... ) ;
... \s+ # includes newline
... \S+ # variable(??) text e.g. "fixed-address"
... \s+
... (\S+) # e.g. "node20007.domain.com"
... ;
... """, re.VERBOSE)
>>> print regex.search(data).groups()
('hardware ethernet   00:22:38:8f:1f:43', '00:22:38:8f:1f:43', 'node20007.domain.com')
>>>

Upvotes: 13

Related Questions