NuclearPeon
NuclearPeon

Reputation: 6059

Capture a Repeating Group in Python using RegEx (see example)

I am writing a regular expression in python to capture the contents inside an SSI tag.

I want to parse the tag:

<!--#include file="/var/www/localhost/index.html" set="one" -->

into the following components:

The problem is that I am at a loss on how to grab these repeating groups, as name/value pairs may occur one or more times in a tag. I have spent hours on this.

Here is my current regex string:

^\<\!\-\-\#([a-z]+?)\s([a-z]*\=\".*\")+? \-\-\>$

It captures the include in the first group and file="/var/www/localhost/index.html" set="one" in the second group, but what I am after is this:

group 1: "include"
group 2: "file"
group 3: "/var/www/localhost/index.html"
group 4 (optional): "set"
group 5 (optional): "one"

(continue for every other name="value" pair)


I am using this site to develop my regex

Upvotes: 3

Views: 6184

Answers (5)

astromancer
astromancer

Reputation: 611

The regex library allows capturing repeated groups (while builtin re does not). This allows for a simple solution without needing external for-loops to parse the groups afterwards.

import regex

string = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
rgx = regex.compile(
    r'<!--#(?<fun>[a-z]+)(\s+(?<key>[a-z]+)\s*=\s*"(?<val>[^"]*)")+')

match = rgx.match(string)
keys, values = match.captures('key', 'val')
print(match['fun'], *map(' = '.join, zip(keys, values)), sep='\n  ')

gives you what you're after

include
  file = /var/www/localhost/index.html
  set = one

Upvotes: 0

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89639

A way with the new python regex module:

#!/usr/bin/python

import regex

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    (?>
        \G(?<!^)
      |
        <!-- \# (?<function> [a-z]+ )
    )
    \s+
    (?<key> [a-z]+ ) \s* = \s* " (?<val> [^"]* ) "
'''

matches = regex.finditer(p, s)

for m in matches:
    if m.group("function"):
        print ("function: " + m.group("function"))
    print (" key:   " + m.group("key") + "\n value: " + m.group("val") + "\n")

The way with re module:

#!/usr/bin/python

import re

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    <!-- \# (?P<function> [a-z]+ )
    \s+
    (?P<params> (?: [a-z]+ \s* = \s* " [^"]* " \s*? )+ )
    -->
'''

matches = re.finditer(p, s)

for m in matches:
    print ("function: " + m.group("function"))
    for param in re.finditer(r'[a-z]+|"([^"]*)"', m.group("params")):
        if param.group(1):
            print (" value: " + param.group(1) + "\n")
        else:
            print (" key:   " + param.group())

Upvotes: 3

Chrispresso
Chrispresso

Reputation: 4131

Unfortunately python does not allow for recursive regular expressions.
You can instead do this:

import re
string = '''<!--#include file="/var/www/localhost/index.html" set="one" set2="two" -->'''
regexString = '''<!--\#(?P<tag>\w+)\s(?P<name>\w+)="(?P<value>.*?")\s(?P<keyVal>.*)\s-->'''
regex = re.compile(regexString)
match = regex.match(string)
tag = match.group('tag')
name = match.group('name')
value = match.group('value')
keyVal = match.group('keyVal').split()
for item in keyVal:
    key, val in item.split('=')
    # You can now do whatever you want with the key=val pair

Upvotes: 0

Adam Smith
Adam Smith

Reputation: 54243

Grab everything that can be repeated, then parse them individually. This is probably a good use case for named groups, as well!

import re

data = """<!--#include file="/var/www/localhost/index.html" set="one" reset="two" -->"""
pat = r'''^<!--#([a-z]+) ([a-z]+)="(.*?)" ((?:[a-z]+?=".+")+?) -->'''

result = re.match(pat, data)
result.groups()
('include', 'file', '/var/www/localhost/index.html', 'set="one" reset="two"')

Then iterate through it:

g1, g2, g3, g4 = result.groups()
for keyvalue in g4.split(): # split on whitespace
    key, value = keyvalue.split('=')
    # do something with them

Upvotes: 3

aliteralmind
aliteralmind

Reputation: 20163

I recommend against using a single regular expression to capture every item in a repeating group. Instead--and unfortunately, I don't know Python, so I'm answering it in the language I understand, which is Java--I recommend first extracting all attributes, and then looping through each item, like this:

   import  java.util.regex.Pattern;
   import  java.util.regex.Matcher;
public class AllAttributesInTagWithRegexLoop  {
   public static final void main(String[] ignored)  {
      String input = "<!--#include file=\"/var/www/localhost/index.html\" set=\"one\" -->";

      Matcher m = Pattern.compile(
         "<!--#(include|echo|set) +(.*)-->").matcher(input);

      m.matches();

      String tagFunc = m.group(1);
      String allAttrs = m.group(2);

      System.out.println("Tag function: " + tagFunc);
      System.out.println("All attributes: " + allAttrs);

      m = Pattern.compile("(\\w+)=\"([^\"]+)\"").matcher(allAttrs);
      while(m.find())  {
         System.out.println("name=\"" + m.group(1) + 
            "\", value=\"" + m.group(2) + "\"");
      }
   }
}

Output:

Tag function: include
All attributes: file="/var/www/localhost/index.html" set="one"
name="file", value="/var/www/localhost/index.html"
name="set", value="one"

Here's an answer that may be of interest: https://stackoverflow.com/a/23062553/2736496


Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference.

Upvotes: 1

Related Questions