Paul R
Paul R

Reputation: 2797

Greedy optional character in regex

I have two questions.

  1. How to make optional character greedy? I'm trying to write custom parser and want that function arguments were in parentheses. For example sin x becomes sin(x) and cosh^2 x becomes cosh^2(x). My regex:

    input = 'sinh x'
    output=re.sub(r'(sin|cos|tan|cot|sec|csc)(h?)\s*(|\^\s*[\(]?\s*\-?\s*\d+\s*[\)]?\s*)?([a-z0-9]+)',r'\1\2\3(\4)', input)
    

    This works fine. But when I input sinh(x) (already good-formed expression), it outputs sin(h)(x). I need to make (h?) greedy or fail if there is no match in \4. How to do that? Note, that I can't write ([a-gi-z0-9]),because sinh(h) is valid.

  2. Is there any difference between (h?) and ([h]?) ?

Upvotes: 7

Views: 6668

Answers (3)

Mike Hinson
Mike Hinson

Reputation: 23

This seems a pretty robust way to parse the input into (function) (possible ^2) (parameter)

(sinh?|cosh?|tan|cot|sec|csc)[ (]*([\^a-z0-9]*?) *([a-z0-9]+)\)?$

simpler & more concise that using look-ahead methods perhaps.

Upvotes: 0

Barmar
Barmar

Reputation: 780909

  1. Optional characters are already greedy (you would use ?? to make it non-greedy). But greediness just means that it will try to find the longest match that still allows the rest of the regular expression to match. It will still backtrack if necessary. If you want to force failure if there's something following it, one way to do that is with a negative lookahead. I'm posting this for the value of the explanation above. Here's a regexp that uses this:

    (sin|cos|tan|cot|sec|csc)(?!.\([^)]*\))(h?)\s*(|\^\s*[\(]?\s*\-?\s*\d+\s*[\)]?\s*)?([a-z0-9]+)
    

DEMO

  1. A character class with a single character in it is identical to just putting that character into the RE directly. Quantifiers after it and capture groups around it don't make any difference. Sometimes single-character classes are useful as an alternative to escaping, e.g. [*]? may be easier to read than \*?.

Upvotes: 2

Sergey Kalinichenko
Sergey Kalinichenko

Reputation: 726559

Rather than making the optional h greedy, consider disambibuating your grammar by requiring that the letter inside parentheses be prefixed by a space or an opening parenthesis:

  ((?<=\s|\()[a-z0-9]+)
// ^^^^^^^^^^^^

This lookbehind ensures that 'h' (or any other letter, for that matter) that follows the name of the function without spaces is not treated as a function parameter.

I would change the overall expression as follows:

((?:sin|cos|tan|cot|sec|csc)(?:h)?)\s*(?:[\^](\d+)\s*)?(?:((?<=\s|\()[a-z0-9]+)|[(]((?<=\s|\()[a-z0-9]+)[)])

to add an optional digit after ^, and to make sure that the parentheses are matched (i.e. both parentheses are present, or both parentheses are missing).

Demo (using Java regex engine).

Upvotes: 1

Related Questions