Paulo Buchsbaum
Paulo Buchsbaum

Reputation: 2659

How to overcome the lack of Perl's \G in JavaScript code?

In Perl, when one wants to do continuous parsing on a string, it can be done something like this my $string = " a 1 # ";

while () {
    if ( $string =~ /\G\s+/gc )    {
        print "whitespace\n";
    }
    elsif ( $string =~ /\G[0-9]+/gim ) {
        print "integer\n";
    }
    elsif ( $string =~ /\G\w+/gim ) {
        print "word\n";
    }
    else {
        print "done\n";
        last;
    }
}

Source: When is \G useful application in a regex?

It produces the following output:

whitespace
word
whitespace
integer
whitespace
done

In JavaScript (and many other regular expressions flavors) there is no \G pattern, nor any good replacement.

So I came up with a very simple solution that serves my purpose.

<!-- language: lang-js --> 
//*************************************************
// pattmatch - Makes the PAT pattern in ST from POS
// notice the "^" use to simulate "/G" directive
//*************************************************
function pattmatch(st,pat,pos)
{
var resu;
pat.lastIndex=0;
if (pos===0)  
    return  pat.exec(st);    // busca qualquer identificador  
else  {
  resu = pat.exec(st.slice(pos));    // busca qualquer identificador  
  if (resu) 
      pat.lastIndex = pat.lastIndex + pos;
  return resu;
}  // if

}

So, the above example would look like this in JavaScript (node.js):

<!-- language: lang-js -->
var string = " a 1 # ";
var pos=0, ret;  
var getLexema  = new RegExp("^(\\s+)|([0-9]+)|(\\w+)","gim");  
while (pos<string.length && ( ret = pm(string,getLexema,pos)) ) {
    if (ret[1]) console.log("whitespace");
    if (ret[2]) console.log("integer");
    if (ret[3]) console.log("word");
    pos = getLexema.lastIndex;
}  // While
console.log("done");

It produces the same output than Perl code snippet:

whitespace
word
whitespace
integer
whitespace
done

Notice the parser stop at # character. One can continue parsing in another code snippet from pos position.

Is there a better way in JavaScript to simulate Perl's /G regex pattern?

Post edition

For curiosity, I've decided to compare my personal solution with @georg proposal. Here I do not state which code is best. For me, tt's a matter of taste.

It will my system, which will depend a lot on user interaction, become slow?

@ikegami writes about @georg solution:

... his solution adds is a reduction in the number of times your input file is copied ...

So I've decided compare both solutions in a loop that repeats the code code 10 million times:

<!-- language: lang-js -->
var i;
var n1,n2;
var string,pos,m,conta,re;

// Mine code
conta=0;
n1 = Date.now();
for (i=0;i<10000000;i++) {
  string = " a 1 # ";
  pos=0, m;  
  re  = new RegExp("^(\\s+)|([0-9]+)|(\\w+)","gim");  
  while (pos<string.length && ( m = pattMatch(string,re,pos)) ) {
    if (m[1]) conta++;
    if (m[2]) conta++;
    if (m[3]) conta++;
    pos = re.lastIndex;
  }  // While
}
n2 = Date.now();
console.log('Mine: ' , ((n2-n1)/1000).toFixed(2), ' segundos' );


// Other code
conta=0;
n1 = Date.now();

for (i=0;i<10000000;i++) {
  string = " a 1 # ";
  re  = /^(?:(\s+)|([0-9]+)|(\w+))/i;
  while (m = string.match(re)) {
   if (m[1]) conta++;
   if (m[2]) conta++;
   if (m[3]) conta++;
   string = string.slice(m[0].length)
 }
 }
n2 = Date.now();
console.log('Other: ' , ((n2-n1)/1000).toFixed(2) , ' segundos');

//*************************************************
// pattmatch - Makes the PAT pattern in ST from POS
// notice the "^" use to simulate "/G" directive
//*************************************************
function pattMatch(st,pat,pos)
{
var resu;
pat.lastIndex=0;
if (pos===0)  
    return  pat.exec(st);    
else  {
  resu = pat.exec(st.slice(pos)); 
  if (resu) 
      pat.lastIndex = pat.lastIndex + pos;
  return resu;
}  
} // pattMatch

Results:

Mine: 11.90 segundos
Other: 10.77 segundos

My code runs about 10% longer. It spends about 110 nanoseconds more per iteration.

Honestly, according to my personal preference, I accept this loss of efficiency as acceptable to me, in a system with heavy user interaction.

If my project involved heavy mathematical processing with multidimensional arrays or gigantic neural networks, I might rethink.

Upvotes: 7

Views: 308

Answers (2)

georg
georg

Reputation: 214959

Looks like you're overcomplicating it a bit. exec with the g flag provides anchoring out of the box:

var 
    string = " a 1 # ",
    re  = /(\s+)|([0-9]+)|(\w+)|([\s\S])/gi,
    m;

while (m = re.exec(string)) {
    if (m[1]) console.log('space');
    if (m[2]) console.log('int');
    if (m[3]) console.log('word');
    if (m[4]) console.log('unknown');    
}

If your regexp is not covering, and you want to stop on the first non-match, the simplest way would be to match from the ^ and strip the string once matched:

    var 
        string = " a 1 # ",
        re  = /^(?:(\s+)|([0-9]+)|(\w+))/i,
        m;

    while (m = string.match(re)) {
        if (m[1]) console.log('space');
        if (m[2]) console.log('int');
        if (m[3]) console.log('word');
        string = string.slice(m[0].length)
    }

    console.log('done, rest=[%s]', string)

This simple method doesn't fully replace \G (or your "match from" method), because it loses the left context of the match.

Upvotes: 2

ikegami
ikegami

Reputation: 385847

The functionality of \G exists in form of the /y flag.

var regex = /^foo/y;
regex.lastIndex = 2;
regex.test('..foo');   // false - index 2 is not the beginning of the string

var regex2 = /^foo/my;
regex2.lastIndex = 2;
regex2.test('..foo');  // false - index 2 is not the beginning of the string or line
regex2.lastIndex = 2;
regex2.test('.\nfoo'); // true - index 2 is the beginning of a line

But it's quite new. You won't be able to use it on public web sites yet. Check the browser compatibility chart in the linked documentation.

Upvotes: 4

Related Questions