Reputation: 826
I'm trying to write a regex in python to get F1 to F8 fields from a line that looks like this:
LineNumber(digits): F1, F2, F3, ..., F8;
F1
to F8
can have lowercase/uppercase letters and hyphens.
For example:
Header
Description
21: Yes, No, Yes, No, Ye-s, N-o, YES, NO;
Footer
What I've tried so far is
matched = re.match(r'\d+: ([a-zA-Z-]*, ){7}(.*);', line)
which matches the lines with the above format. However, when I call matched.groups()
to print the matched fields, I only get F7,
and F8
while the expected output is a list containing F1,
to F7,
plus F8
.
I have a few questions regarding this regex:
I guess groups()
method returns the fields that were grouped in the regex using (...)
. Why don't I get F1 to F6 in the output while they are grouped using (...)
and have matched the regex?
What is a better regex I can write to exclude ,
from F1 to F7? (A short explanation of the suggested regex is much appreciated)
Upvotes: 0
Views: 162
Reputation: 113988
>>> pat = re.compile("""\s+ # one or more spaces
(.*?) # the shortest anything (capture)
\s* # zero or more spaces
[;,] # a semicolon or a colon
""",re.X)
>>> pat.findall("LineNumber(digits): F1, F2, F3, F4, F5, F6, F7, F8;")
['F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8']
Upvotes: 1
Reputation: 4037
When you have a construct like (pattern){number}
then although it matches multiple instances, only the last one will be stored. In other words, you get one bucket per ()
, even if you parse it multiple times, in which case the last instance is the one kept. Note that you will get a bucket for ALL bracket pairs, even if they are not used, as in something like (a(b)?c)?d
matching d
.
If you know how many items to expect, then you can do your regexp the long way:
\d+: *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *, *([a-zA-Z-]+) *;
This way, since you have 8 sets of brackets, you have 8 items in your matched.groups()
array. Also, we're not capturing the spaces and commas between the fields.
Given that your string is a CSV, you may be better off parsing it differently and splitting on commas rather than trying to have a single regexp to match the whole line.
Upvotes: 0