Matt
Matt

Reputation: 826

How to write a regex for a text including comma separated values in python?

I'm trying to write a regex in python to get F1 to F8 fields from a line that looks like this:

LineNumber(digits): F1, F2, F3, ..., F8;

F1 to F8 can have lowercase/uppercase letters and hyphens.

For example:

Header
Description
21: Yes, No, Yes, No, Ye-s, N-o, YES, NO;
Footer

What I've tried so far is matched = re.match(r'\d+: ([a-zA-Z-]*, ){7}(.*);', line) which matches the lines with the above format. However, when I call matched.groups() to print the matched fields, I only get F7, and F8 while the expected output is a list containing F1, to F7, plus F8.

I have a few questions regarding this regex:

  1. I guess groups() method returns the fields that were grouped in the regex using (...). Why don't I get F1 to F6 in the output while they are grouped using (...) and have matched the regex?

  2. What is a better regex I can write to exclude , from F1 to F7? (A short explanation of the suggested regex is much appreciated)

Upvotes: 0

Views: 162

Answers (2)

Joran Beasley
Joran Beasley

Reputation: 113988

>>> pat = re.compile("""\s+ # one or more spaces
                      (.*?) # the shortest anything (capture)
                      \s*   # zero or more spaces
                      [;,]  # a semicolon or a colon
                     """,re.X)
>>> pat.findall("LineNumber(digits): F1, F2, F3, F4, F5, F6, F7, F8;")
['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8']

Upvotes: 1

Steve Shipway
Steve Shipway

Reputation: 4037

When you have a construct like (pattern){number} then although it matches multiple instances, only the last one will be stored. In other words, you get one bucket per (), even if you parse it multiple times, in which case the last instance is the one kept. Note that you will get a bucket for ALL bracket pairs, even if they are not used, as in something like (a(b)?c)?d matching d.

If you know how many items to expect, then you can do your regexp the long way:

\d+: *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *;

This way, since you have 8 sets of brackets, you have 8 items in your matched.groups() array. Also, we're not capturing the spaces and commas between the fields.

Given that your string is a CSV, you may be better off parsing it differently and splitting on commas rather than trying to have a single regexp to match the whole line.

Upvotes: 0

Related Questions