TrantSteel
TrantSteel

Reputation: 245

Regex optional non-capturing groups

i am a total Regex Noob and spent hours trying to solve this puzzle. I think I have to use some kind of optional non-capturing groups or alternation.

I want to match the following strings:

  1. Neuer Film a von 1000

  2. Neuer Film a von 1000 mit b

  3. Neuer Film a von 1000 mit b und c

  4. Neuer Film a von 1000 mit b und c und d

  5. Neuer Film a mit b

  6. Neuer Film a mit b und c

  7. Neuer Film a mit b und c und d

My regex looks like this:

var regex = /(?:[nN]euer [Ff]ilm\s?)(.*)(?:[vV]on).(\d{4}).(?:[Mm]it)(.*)(?:[uU]nd)(.*)/g;

The problem is it matches only string 3 and 4. And it does not match the last two "und", but packs it in group No.3 not in group No.4.

Can someone please help with my Regex (which is not very user friendly at all ;)

Upvotes: 19

Views: 18805

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626738

You really need to use non-capturing optional groups (like (?:...)?), but besides, you also need anchors (^ to match the start of the string and $ to match the string end) and lazy dot matching patterns (.*?, to match as few any chars as possible).

You may use

/^[nN]euer [Ff]ilm\s*(.*?)(?:\s*[vV]on\s+(\d{4}))?(?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)?$/

See the regex demo. In the demo, /gm modifiers are necessary since the input is a multiline string.

Pattern details:

  • ^ - start of a string anchor
  • [nN]euer [Ff]ilm - Neuer film / Neuer Film / neuer Film
  • \s* - zero or more whitespaces
  • (.*?) - Group 1: any 0+ chars other than line break chars, as few as possible (that is, up to the leftmost occurrence of the subsequent subpatterns)
  • (?:\s*[vV]on\s+(\d{4}))? - 1 or 0 occurrences of:
    • \s* - 0+ whitespaces
    • [vV]on - von or Von
    • \s+ - 1+ whitespaces
    • (\d{4}) - Group 2: 4 digits
  • (?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)? - an optional non-capturing group matching 1 or 0 occurrences of:
    • \s+ - 1+ whitespaces
    • [Mm]it - Mit or mit
    • \s* - 0+ whitespaces
    • (.*?) - Group 3 matching any 0+ chars other than line break chars, as few as possible
    • (?:\s*[uU]nd\s*(.*))? - an optional non-capturing group matching
      • \s*[uU]nd\s* - und or Und enclosed with 0+ whitespaces
      • (.*) - Group 4 matching any 0+ chars other than line break chars, as many as possible
  • $ - end of string.

var strs = ['Neuer Film a von 1000','Neuer Film a von 1000 mit b','Neuer Film a von 1000 mit b und c','Neuer Film a von 1000 mit b und c und d','Neuer Film a mit b','Neuer Film a mit b und c','Neuer Film a mit b und c und d'];
var rx = /^[nN]euer [Ff]ilm\s*(.*?)(?:\s*[vV]on\s+(\d{4}))?(?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)?$/;
for (var s of strs) {
   var m = rx.exec(s);
   if (m) {
     console.log('-- ' + s + ' ---');
     console.log('Group 1: ' + m[1]);
     if (m[2]) console.log('Group 2: ' + m[2]);
     if (m[3]) console.log('Group 3: ' + m[3]);
     if (m[4]) console.log('Group 4: ' + m[4]);
   }
   
}

Upvotes: 20

Related Questions