Reputation: 133
I am trying to parse data that would appear in a course catalog using perl, but I am struggling with getting my regular expression to work correctly.
A few sample lines of data are below:
Course description goes here; There might be more text; 3 hours of lecture, 2 hours of laboratory. Prerequisite: None
Another course description is here; 3 hours of lecture and laboratory. Prerequisite: None
More description; 4 hours of laboratory. Prerequisite: None
I wanted to capture the full description (everything before the final semicolon), then the hours (and later I would handle what hour matches up with lecture or lab). The regular expression I was trying to use was this:
/^(.*)\; *([0-9]).*?(lecture|laboratory).*?([0-9])?.*$/
It seems to work up until ([0-9])?
. I thought this would match the second hour number (if there was one), and then the .*
after it would match the rest of the line, but this isn't the case. Instead, the final .*
matches the second hour and everything after it too.
Why doesn't the use of the ?
match the second hour if it is there. Is it a problem with greediness, or did I make a mistake in some other way?
Upvotes: 0
Views: 121
Reputation: 126742
The problem is that the second .*?
always matches the empty string. Because of the ?
it is forced to match as few characters as possible, and the optional ([0-9])?
allows it to match nothing.
To fix this, change .*?
to match just non-numeric characters, like this
/^(.*)\; ([0-9]).*?(lecture|laboratory)[^0-9]*([0-9]*)/
Also, changing ([0-9])?
to ([0-9]*)
will set $4 to an empty string if there is no second hours figure, instead of leaving it undefined.
Upvotes: 1
Reputation: 56935
It doesn't match the second hour because .*?
is non-greedy: it must take the shortest match. Since everything after the (lecture|laboratory)
is optional, the shortest possible match is that .*?
matches nothing, ([0-9])?
also matches nothing, and the .*
matches everything.
You can change it to be like this:
/^(.*)\; *([0-9]).*?(lecture|laboratory)(.*?([0-9]))?.*$/
Note that the optional part is now (.*?([0-9]))?
, i.e. the first .*?
is paired with a mandatory [0-9]
. This means the .*?
is only used if there is a second digit to use it with.
Upvotes: 1
Reputation: 189607
Since the regex before [0-9]
is non-greedy, it will match as short a string as possible.
It might be better to constrain your matches by specifying what you want to include, i.e. use something like [^;0-9]*
instead of .*?
to match a sequence which should not include semicolons or numbers.
Upvotes: 1