Reputation: 2659
In Perl, when one wants to do continuous parsing on a string, it can be done something like this my $string = " a 1 # ";
while () {
if ( $string =~ /\G\s+/gc ) {
print "whitespace\n";
}
elsif ( $string =~ /\G[0-9]+/gim ) {
print "integer\n";
}
elsif ( $string =~ /\G\w+/gim ) {
print "word\n";
}
else {
print "done\n";
last;
}
}
Source: When is \G useful application in a regex?
It produces the following output:
whitespace
word
whitespace
integer
whitespace
done
In JavaScript (and many other regular expressions flavors) there is no \G
pattern, nor any good replacement.
So I came up with a very simple solution that serves my purpose.
<!-- language: lang-js -->
//*************************************************
// pattmatch - Makes the PAT pattern in ST from POS
// notice the "^" use to simulate "/G" directive
//*************************************************
function pattmatch(st,pat,pos)
{
var resu;
pat.lastIndex=0;
if (pos===0)
return pat.exec(st); // busca qualquer identificador
else {
resu = pat.exec(st.slice(pos)); // busca qualquer identificador
if (resu)
pat.lastIndex = pat.lastIndex + pos;
return resu;
} // if
}
So, the above example would look like this in JavaScript (node.js
):
<!-- language: lang-js -->
var string = " a 1 # ";
var pos=0, ret;
var getLexema = new RegExp("^(\\s+)|([0-9]+)|(\\w+)","gim");
while (pos<string.length && ( ret = pm(string,getLexema,pos)) ) {
if (ret[1]) console.log("whitespace");
if (ret[2]) console.log("integer");
if (ret[3]) console.log("word");
pos = getLexema.lastIndex;
} // While
console.log("done");
It produces the same output than Perl code snippet:
whitespace
word
whitespace
integer
whitespace
done
Notice the parser stop at #
character. One can continue parsing in another code snippet from pos
position.
❖
Is there a better way in JavaScript to simulate Perl's /G
regex pattern?
For curiosity, I've decided to compare my personal solution with @georg proposal. Here I do not state which code is best. For me, tt's a matter of taste.
It will my system, which will depend a lot on user interaction, become slow?
@ikegami writes about @georg solution:
... his solution adds is a reduction in the number of times your input file is copied ...
So I've decided compare both solutions in a loop that repeats the code code 10 million times:
<!-- language: lang-js -->
var i;
var n1,n2;
var string,pos,m,conta,re;
// Mine code
conta=0;
n1 = Date.now();
for (i=0;i<10000000;i++) {
string = " a 1 # ";
pos=0, m;
re = new RegExp("^(\\s+)|([0-9]+)|(\\w+)","gim");
while (pos<string.length && ( m = pattMatch(string,re,pos)) ) {
if (m[1]) conta++;
if (m[2]) conta++;
if (m[3]) conta++;
pos = re.lastIndex;
} // While
}
n2 = Date.now();
console.log('Mine: ' , ((n2-n1)/1000).toFixed(2), ' segundos' );
// Other code
conta=0;
n1 = Date.now();
for (i=0;i<10000000;i++) {
string = " a 1 # ";
re = /^(?:(\s+)|([0-9]+)|(\w+))/i;
while (m = string.match(re)) {
if (m[1]) conta++;
if (m[2]) conta++;
if (m[3]) conta++;
string = string.slice(m[0].length)
}
}
n2 = Date.now();
console.log('Other: ' , ((n2-n1)/1000).toFixed(2) , ' segundos');
//*************************************************
// pattmatch - Makes the PAT pattern in ST from POS
// notice the "^" use to simulate "/G" directive
//*************************************************
function pattMatch(st,pat,pos)
{
var resu;
pat.lastIndex=0;
if (pos===0)
return pat.exec(st);
else {
resu = pat.exec(st.slice(pos));
if (resu)
pat.lastIndex = pat.lastIndex + pos;
return resu;
}
} // pattMatch
Results:
Mine: 11.90 segundos
Other: 10.77 segundos
My code runs about 10% longer. It spends about 110 nanoseconds more per iteration.
Honestly, according to my personal preference, I accept this loss of efficiency as acceptable to me, in a system with heavy user interaction.
If my project involved heavy mathematical processing with multidimensional arrays or gigantic neural networks, I might rethink.
Upvotes: 7
Views: 308
Reputation: 214959
Looks like you're overcomplicating it a bit. exec
with the g
flag provides anchoring out of the box:
var
string = " a 1 # ",
re = /(\s+)|([0-9]+)|(\w+)|([\s\S])/gi,
m;
while (m = re.exec(string)) {
if (m[1]) console.log('space');
if (m[2]) console.log('int');
if (m[3]) console.log('word');
if (m[4]) console.log('unknown');
}
If your regexp is not covering, and you want to stop on the first non-match, the simplest way would be to match from the ^
and strip the string once matched:
var
string = " a 1 # ",
re = /^(?:(\s+)|([0-9]+)|(\w+))/i,
m;
while (m = string.match(re)) {
if (m[1]) console.log('space');
if (m[2]) console.log('int');
if (m[3]) console.log('word');
string = string.slice(m[0].length)
}
console.log('done, rest=[%s]', string)
This simple method doesn't fully replace \G
(or your "match from" method), because it loses the left context of the match.
Upvotes: 2
Reputation: 385847
The functionality of \G
exists in form of the /y
flag.
var regex = /^foo/y;
regex.lastIndex = 2;
regex.test('..foo'); // false - index 2 is not the beginning of the string
var regex2 = /^foo/my;
regex2.lastIndex = 2;
regex2.test('..foo'); // false - index 2 is not the beginning of the string or line
regex2.lastIndex = 2;
regex2.test('.\nfoo'); // true - index 2 is the beginning of a line
But it's quite new. You won't be able to use it on public web sites yet. Check the browser compatibility chart in the linked documentation.
Upvotes: 4