Regex that grabs variable number of groups

Question

This is not a question asking how to use re.findall() or the global modifier (?g) or \g. This is asking how to match n groups with one regex expression, with n between 3 and 5.

Rules:

needs to ignore lines with first non-space character as # (comments)
needs to get at least three items, always in order: ITEM1, ITEM2, ITEM3
- class ITEM1(stuff)
- model = ITEM2
- fields = (ITEM3)
needs to get any of the following matches if they exist (UNKNOWN order, and can be missing)
- write_once_fields = (ITEM4)
- required_fields = (ITEM5)
needs to know which match is which, so either retrieve matches in order, returning None if there is no match, or retrieve pairs.

My question is if this is doable, and how?

I've gotten this far, but it hasn't dealt with comments or unknown order or if some items are missing and to stop searching for this particular regex when you see the next class definition. https://www.regex101.com/r/cG5nV9/8

(?s)\nclass\s(.*?)(?=$)
.*?
model\s=\s(.*?)\n
.*?
(?=fields.*?\((.*?)$)
.*?
(?=write_once_fields.*?$(.*?)$)
.*?
(?=required_fields.*?$(.*?)$)

Do I need a conditional?

Thanks for any kinds of hints.

Adam Smith · Accepted Answer

I'd do something like:

from collections import defaultdict
import re

comment_line = re.compile(r"\s*#")
matches = defaultdict(dict)

with open('path/to/file.txt') as inf:
    d = {} # should catch and dispose of any matching lines
           # not related to a class
    for line in inf:
        if comment_line.match(line):
            continue # skip this line
        if line.startswith('class '):
            classname = line.split()[1]
            d = matches[classname]
        if line.startswith('model'):
            d['model'] = line.split('=')[1].strip()
        if line.startswith('fields'):
            d['fields'] = line.split('=')[1].strip()
        if line.startswith('write_once_fields'):
            d['write_once_fields'] = line.split('=')[1].strip()
        if line.startswith('required_fields'):
            d['required_fields'] = line.split('=')[1].strip()

You could probably do this easier with regex matching.

comment_line = re.compile(r"\s*#")
class_line = re.compile(r"class (?P)")
possible_keys = ["model", "fields", "write_once_fields", "required_fields"]
data_line = re.compile(r"\s*(?P" + "|".join(possible_keys) +
                       r")\s+=\s+(?P.*)")

with open( ...
    d = {} # default catcher as above
    for line in ...
       if comment_line.match(line):
           continue
       class_match = class_line.match(line)
       if class_match:
           d = matches[class_match.group('classname')]
           continue # there won't be more than one match per line
       data_match = data_line.match(line)
       if data_match:
           key,value = data_match.group('key'), data_match.group('value')
           d[key] = value

But this might be harder to understand. YMMV.

Regex that grabs variable number of groups

Answers (1)

Related Questions