How come my regex grouping is not grouping correctly?

Question

Example of a line in the file: "CIS 14A Visual Basic .NET Programming I x x x x"

I'm trying to group lines in a file into three groups: group(0) should be the course number (14A), group(1) should be the topic (Visual Basic .NET Programming I), and group(2) should be the quarters the course is available in. However, when I tested the code, group(0) matched the whole line, group(1) was the course #, group(2) was empty..., and group(3) was a combination of the topic and quarters available. I can't find what's wrong with it because each set of parenthesis creates a group but all the groups are in the wrong order and "CIS" which I have no included in any parenthesis was included in group(0) for some reason. I'm new to regex so any advice on how to fix my code would be much appreciated.

    with open(filename) as infile:
        for line in infile:
            self._match = (re.search('^CIS\s(\d*\w*)(\w*)\s?[^x]*(.*)$', line, re.I))
            self._numb = self._match.group(0).strip()
            self._name = self._match.group(1).strip()
            self._quarter=self._match.group(2).strip().split('x')

Wiktor Stribiżew · Accepted Answer

Note there are always as many .group()s as there are capturing groups + 1 as the zeroth group is reserved for the entire match.

The regex you may use is

^CIS\s+([0-9A-Z]+)\s(.*?)\s(x\s.*)

See the regex demo

See Python snippet:

with open(filename, 'r') as infile:
    for line in infile:
        self._match = re.search(r'^CIS\s+([0-9A-Z]+)\s(.*?)\s(x\s.*)', line, re.I)
        if self._match:
            self._numb = self._match.group(1).strip()
            self._name = self._match.group(2).strip()
            self._quarter=self._match.group(3).strip().split('x')

Regex details

^ - Start of string
CIS - a literal substring
\s+ - 1+ whitespaces
([0-9A-Z]+) - Group 1: one or more digits or uppercase letters
\s - a whitespace
(.*?) - Group 2: any 0 or more chars other than line break chars as few as possible
\s - whitespace
(x\s.*) - Group 3: x, whitespace and any 0 or more chars other than line break chars as many as possible.

Also, check the regex graph:

How come my regex grouping is not grouping correctly?

Answers (1)

Related Questions