pjgat09
pjgat09

Reputation: 133

Why isn't the ? in a regular expression working?

I am trying to parse data that would appear in a course catalog using perl, but I am struggling with getting my regular expression to work correctly.

A few sample lines of data are below:

Course description goes here; There might be more text; 3 hours of lecture, 2 hours of laboratory. Prerequisite: None
Another course description is here; 3 hours of lecture and laboratory. Prerequisite: None
More description; 4 hours of laboratory. Prerequisite: None

I wanted to capture the full description (everything before the final semicolon), then the hours (and later I would handle what hour matches up with lecture or lab). The regular expression I was trying to use was this:

/^(.*)\; *([0-9]).*?(lecture|laboratory).*?([0-9])?.*$/

It seems to work up until ([0-9])?. I thought this would match the second hour number (if there was one), and then the .* after it would match the rest of the line, but this isn't the case. Instead, the final .* matches the second hour and everything after it too.

Why doesn't the use of the ? match the second hour if it is there. Is it a problem with greediness, or did I make a mistake in some other way?

Upvotes: 0

Views: 121

Answers (3)

Borodin
Borodin

Reputation: 126742

The problem is that the second .*? always matches the empty string. Because of the ? it is forced to match as few characters as possible, and the optional ([0-9])? allows it to match nothing.

To fix this, change .*? to match just non-numeric characters, like this

/^(.*)\; ([0-9]).*?(lecture|laboratory)[^0-9]*([0-9]*)/

Also, changing ([0-9])? to ([0-9]*) will set $4 to an empty string if there is no second hours figure, instead of leaving it undefined.

Upvotes: 1

mathematical.coffee
mathematical.coffee

Reputation: 56935

It doesn't match the second hour because .*? is non-greedy: it must take the shortest match. Since everything after the (lecture|laboratory) is optional, the shortest possible match is that .*? matches nothing, ([0-9])? also matches nothing, and the .* matches everything.

You can change it to be like this:

/^(.*)\; *([0-9]).*?(lecture|laboratory)(.*?([0-9]))?.*$/

Note that the optional part is now (.*?([0-9]))?, i.e. the first .*? is paired with a mandatory [0-9]. This means the .*? is only used if there is a second digit to use it with.

Upvotes: 1

tripleee
tripleee

Reputation: 189607

Since the regex before [0-9] is non-greedy, it will match as short a string as possible.

It might be better to constrain your matches by specifying what you want to include, i.e. use something like [^;0-9]* instead of .*? to match a sequence which should not include semicolons or numbers.

Upvotes: 1

Related Questions