IcedDante
IcedDante

Reputation: 6842

Regular Expression: matching plural cases at end of string

Working on a Java regular expression that will match either "es" or "s" at the end of the string and return the substring without that suffix. Seems easy, but I can't get the 'e' to match with the expressions I'm trying.

Here's the output I should get:

"inches" -> "inch"

"meters" -> "meter"

"ounces" -> "ounc"

but with this regular expression:

Pattern.compile("(.+)(es|s)$", Pattern.CASE_INSENSITIVE);

I'm actually getting:

"inches" -> "inche"

After some research I discovered that the ".+" part of my search is too greedy, and changing it to this:

Pattern.compile("(.+?)(es|s)$", Pattern.CASE_INSENSITIVE);

fixes the problem. My question, though, is why did the 's' match at all? If the 'greedy' nature of the algorithm was the problem, shouldn't it have matched the whole string?

Upvotes: 1

Views: 2611

Answers (2)

Jonah
Jonah

Reputation: 1545

When it matches greedily, it matches as much as it can while still meeting the expression. So when it's greedy, it will take everything except the s, because it cannot take the s and still meet the expression. When it matches non-greedily, it matches as little as possible while still meeting the expression. Therefore, it will take everything except the 'es', because that is as little as it can take while still meeting the expression.

Upvotes: 4

willeM_ Van Onsem
willeM_ Van Onsem

Reputation: 477533

Short answer

Greedy doesn't mean possessive. Greedy aims to consume/eat as much as possible; but will stop from the moment a string will no longer match otherwise.

Long answer

In regular expressions the Kleene star (*) is greedy, it means it tries to take as much as possible, but not more. Consider the regex:

(.+)(es|s)$

here .+ aims to eat as much as possible. But you can only reach the end of the regex, when you somehow manage to pass (es|s), which is only possible if it ends with at least one s. Or if we align your string inches:

(.+)  (es|e)$
inche s

(spaces added). In other words .+.

When you make it non-greedy, the .+? tries to give up eating as soon as possible. For the string inches, this is after the inch:

(.+?) (es|e)$
inch  es

It cannot give up earlier, because then the h should somehow have to match with (es|e).

Upvotes: 3

Related Questions