wiherek
wiherek

Reputation: 1923

Capturing groups regex with global flag in JavaScript

I have the use case, where I need to allow for processing an arbitrary array of strings, by an arbitrary regex, which is created either by a regex literal, or through the new RegExp() constructor.

It all works fine, until the global g flag is used with capturing groups.

I read a few answers on SO, and the proposed solution is to use regex.exec(string) in a while loop, e.g. How do you access the matched groups in a JavaScript regular expression?, JavaScript regular expressions and sub-matches

I also talked about it on IRC, and was advised against the implementation all together:

but there's regexes that will segfault your engine, unless you're using spidermonkey.

So here is a corner case, try pasting it into a fiddle or plunker, or even the console, it just breaks:

var regexString = '([^-]*)';
var flags = 'ig';
var regex = new RegExp(regexString, flags);

var arr = ['some-property-image.png', 'another-prop-video.png', 'y-no-work.bmp'];
var result = [];

arr.forEach(function(item) {
  var match;
  var inter = [];
  while (match = regex.exec(item)) {
    inter.push(match[0]);
  }
});

console.log(result);

I did tried it on regex101.com https://regex101.com/r/xG0cL4/1 It breaks even if I do it without the quantifier, i.e. /([^-])/g https://regex101.com/r/yT7sQ2/1

My question: what is the (correct|safe) way to process arbitrary regexes against arbitrary strings?

Upvotes: 1

Views: 2155

Answers (3)

James Wilkins
James Wilkins

Reputation: 7378

It's not working because when '-' is reached, exec() fails to match the '-' character, but does match 0 characters (because of *), so it doesn't skip it, and thus it gets stuck. If you use -|([^-]*), then it will skip the '-' character. You will need to then check the 'match.index' property to see if you've reach the end.

Also, you should be adding match[1] not match[0] if your intent is to save the matched text.

This works:

var regexString = '-|([^-]*)'; // or better yet: '([^-]+)' will work also
var flags = 'ig';
var regex = new RegExp(regexString, flags);

var arr = ['some-property-image.png', 'another-prop-video.png', 'y-no-work.bmp'];
var result = [];

arr.forEach(function(item) {
  var match;
  var inter = [];      
  while (match = regex.exec(item)) {
    if (match.index >= item.length) break;
    else if (match[1] !== void 0) inter.push(match[1]);
  }
});

console.log(result);

but why not use 'match()' instead?

var regexString = '[^-]+';
var flags = 'gi';
var regex = new RegExp(regexString, flags);

var arr = ['some-property-image.png', 'another-prop-video.png', 'y-no-work.bmp'];
var result = [];

arr.forEach(function(item) {
  var inter = item.match(regex);
});

console.log(result);

Upvotes: 1

Jason Cust
Jason Cust

Reputation: 10909

Why it's crashing: an infinite loop was created by using a RegExp object with a global flag and exec in a loop with a zero-width match. The loop hits the first '-' but doesn't match that character because of the negated character class. It then reverts back to the zero-length match and so doesn't advance the index value for exec. This means on the next loop it starts back at the same spot and does the exact same thing...infinitely.

That said, it's hard to tell what exactly you want but why not try match? It looks like you only care about the matching string so exec seems to be a bit of overkill.

If the desired output is a one-to-one array of results with the input array:

function foo(regexp, strings) {
  return strings.reduce(function(matches, str) {
    matches.push(str.match(regexp));
    return matches;
  }, []);
}

foo(/([^-]+)/ig, arr);
// outputs: [["some","property","image.png"],["another","prop","video.png"],["y","no","work.bmp"]]

foo(new RegExp('([^-]+)', 'ig'), arr);
// outputs: [["some","property","image.png"],["another","prop","video.png"],["y","no","work.bmp"]]

Even with the zero-width matching it won't go into an infinite loop:

foo(/([^-]*)/ig, arr));
// outputs: [["some","","property","","image.png",""],["another","","prop","","video.png",""],["y","","no","","work.bmp",""]]

If the desired output really is one array of all the matches:

function foo(regexp, strings) {
  return strings.reduce(function(matches, str) {
    return matches.concat(str.match(regexp));
  }, []);
}

foo(/([^-]+)/ig, arr);
// outputs: ["some","property","image.png","another","prop","video.png","y","no","work.bmp"]

foo(new RegExp('([^-]+)', 'ig'), arr);
// outputs:  ["some","property","image.png","another","prop","video.png","y","no","work.bmp"]

foo(/([^-]*)/ig, arr));
// outputs: ["some","","property","","image.png","","another","","prop","","video.png","","y","","no","","work.bmp",""]

Upvotes: 0

Barmar
Barmar

Reputation: 782407

The match object has an index property that contains the position of the current match. If that stays the same between two calls in the loop, it means you're stuck.

var regexString = '([^-]*)';
var flags = 'ig';
var regex = new RegExp(regexString, flags);

var arr = ['some-property-image.png', 'another-prop-video.png', 'y-no-work.bmp'];
var result = [];

arr.forEach(function(item) {
  var match;
  var inter = [];
  var lastIndex = -1;
  while (match = regex.exec(item)) {
    if (match.index == lastIndex) {
      break;
    }
    lastIndex = match.index;
    inter.push(match[0]);
  }
  result.push(inter);
});

console.log(result);

Upvotes: 0

Related Questions