user2609605
user2609605

Reputation: 628

No maximum match although greedy optional-operator

I have a concrete pattern:

\A
(%\s*!\s*T[eE]X program=(?<programMagic>[^} ]+)\R)?
(\\(documentstyle|documentclass)\s*(\[[^]]*\])?\s*\{(?<docClass>[^} ]+)\})?

and i want to match the following string

% !TEX program=lualatex
\documentclass[a4paper,12pt]{book}

I do matching with multiline flag.

As one can see it is about matching the beginning of a latex root file and extracting the program and the document class.

The \A is to restrict to the start of the file and the last two lines of the pattern are optional, yet greedy (operator ?). This is not the whole truth, but I want to identify root files when either a class or a program is given.

If I remove the two ?, the given file is matched and both, program and class are identified correctly and show up in their respective named groups.

If I add both ? as above, then only the second, the class is matched, or well, at least only the class shows up in the group. The other group has value null indicating no match. If I use only one ? then only the other, non-optional part is matched, at least that is what shows up in the groups.

If I understand the theory of regex right, then ? is greedy and so both, class and program must be matched in the version of the pattern above.

This is not the case. Is this a bug or do I misunderstand something?

Reading my own question again, ... since the first line of my string matches the second line of my pattern and the second line of my string matches the third line of my pattern, I think it is more likely that matching is ok, but the matched texts don't show up in the correct groups. This would be an indication of a bug in java's regex engine. Any other ideas??

Upvotes: 0

Views: 102

Answers (2)

rzwitserloot
rzwitserloot

Reputation: 103707

(?&lt;programMagic&gt;[^} ]+)\R)

This is.. bizarre. I assume something got lost in copy and pasting, and what you meant to put there is (?<programMagic>;[^} ]+).

Applying that minor fix, I cannot reproduce your problem. Possibly your attempt to translate this stuff to java failed. Double escaping those backslashes can be a right pain. Here it is:

Pattern p = Pattern.compile(
"\\A" +
"(%\\s*!\\s*T[eE]X program=(?<programMagic>[^} ]+)\\R)?" +
"(\\\\(documentstyle|documentclass)\\s*(\\[[^]]*\\])?\\s*\\{(?<docClass>[^} ]+)\\})?");

String txt = "% !TEX program=lualatex\n" +
"\\documentclass[a4paper,12pt]{book}";

var m = p.matcher(txt);
System.out.println(m.matches());
System.out.println("G1: '" + m.group("programMagic") + "'");
System.out.println("G2: '" + m.group("docClass") + "'");

String txt = "% !TEX program=lualatex\n" +
"\\documentclass[a4paper,12pt]{book}";

m = p.matcher("% !TEX program=lualatex\n");
System.out.println(m.matches());
System.out.println("G1: '" + m.group("programMagic") + "'");
System.out.println("G2: '" + m.group("docClass") + "'");

m = p.matcher("\\documentclass[a4paper,12pt]{book}");
System.out.println(m.matches());
System.out.println("G1: '" + m.group("programMagic") + "'");
System.out.println("G2: '" + m.group("docClass") + "'");

Does as far as I can tell exactly what you wanted:

true
G1: 'lualatex'
G2: 'book'
true
G1: 'lualatex'
G2: 'null'
true
G1: 'null'
G2: 'book'

Upvotes: 1

anubhava
anubhava

Reputation: 785971

I want to identify root files when either a class or a program is given.

You may use this regex with alternation:

\A(?:%\s*!\s*T[eE]X program=(?<programMagic>[^} ]+)\R|(?:.*\R)?\\(documentstyle|documentclass)\s*\[[^]]*]\s*\{(?<docClass>[^} ]+)})

Note that pretty much all of your regex pattern is used here but we are using it like:

\A(?:patternLine1\R|(?:\R)?patternLine2)

RegEx Demo

Upvotes: 1

Related Questions