Reputation: 3763
I have already parsed a string up to index idx
. My next parse step uses a Regexp. It needs to match the next part of the string, i.e. staring from position idx
. How do I do this efficiently?
For example:
let myString = "<p>ONE</p><p>TWO</p>"
let idx
// some code not shown here parses the first paragraph
// and updates idx
idx = 10
// next parse step must continue from idx
let myRegex = /<p>[^<]*<\/p>/
let subbed = myString.substring(idx)
let result = myRegex.exec(subbed)
console.log(result) // "<p>TWO</p>", not "<p>ONE</p>"
But myString.substring(idx)
seems like a quite expensive operation.
Are there no regex operations like this: result = myRegex.execFromIndex(idx, myString);
?
In general, I want to start regex matching from different indexes so I can exclude parts of the string and avoid matches that are already parsed. So one time it can be from myString[0] another time myString[51] and so on.
Is there a way to do this efficiently? I'm parsing hundreds of thousands of lines and want to do this in an as cheap way as possible.
Upvotes: 13
Views: 4270
Reputation: 15030
Regexp.exec
and lastIndex
y
or g
flag
y
flag, the match must start exactly at the specified start indexg
flag, the match can occur anywhere after the specified indexlastIndex
property to the start indexexec
I've applied the above steps to your example code:
let myString = "<p>ONE</p><p>TWO</p>"
let idx
// some code not shown here parses the first paragraph
// and updates idx
idx = 10
// next parse step must continue from idx
let myRegex = /<p>[^<]*<\/p>/y // 🚩note the 'y' flag!🚩
myRegex.lastIndex = idx
let result = myRegex.exec(myString)
console.log(result) // "<p>TWO</p>", not "<p>ONE</p>"
Another useful thing to know is that exec
will update lastIndex
to point to the position in the string after the returned match. This allows you to do many things, including:
lastIndex
value to a different Regexp, if the next thing you want to parse has a different pattern.lastIndex
value into a variable used by your non-regex parsing.lastIndex
to the caller of your function, so the caller can proceed with the rest of the string however it wants.string.slice
and substring
are good solutions tooBut
myString.substring(idx)
seems like a quite expensive operation.
Not necessarily so! Although they probably won't be as fast as Rust, all the leading Javascript engines (SpiderMonkey, V8, JavaScriptCore) do exactly what you describe for Rust. They optimize string.slice
and substring
behind the scenes, using pointers into the source string rather than making copies.
Adventures in the land of substrings and RegExps has a lot of great detail, pictures and analysis, but it is five years old and things have likely gotten even better since. There is the answer to this StackOverflow question: Is Javascript substring virtual?
Upvotes: 9
Reputation: 20504
A JavaScript Regexp has a lastIndex
property that is used in Regexp.exec()
as a placeholder that contains the index of the last match, show it knows where to start next. So setting myRegex.lastIndex = 3
should solve your problem.
It's more efficient than substring method because it doesn't need to create an extra variable and setting the lastIndex property is probably a quicker operation than doing a substring. Everything else is the exactly the same as you were doing.
Below is a test since that shows that setting lastIndex
will produce the same result as doing the substring
first.
var result1Elem = document.getElementById('result1');
var result2Elem = document.getElementById('result2');
var runBtn = document.getElementById('RunBtn');
runBtn.addEventListener("click", runTest);
function runTest() {
var substrStart = +document.getElementById('substrStartText').value
var myRegex1 = new RegExp(document.getElementById('regexText').value, 'g');
myRegex1.lastIndex = substrStart;
var myRegex2 = new RegExp(document.getElementById('regexText').value, 'g');
var myString1 = document.getElementById('testText').value;
var myString2 = myString1.substring(3);
var result;
var safety = 0;
while ((result = myRegex1.exec(myString1)) !== null) {
result1Elem.innerHTML += '<li>' + result[0] + ' at ' + result.index + '</li>';
if (safety++ > 50) break;
}
safety = 0;
while ((result = myRegex2.exec(myString2)) !== null) {
result2Elem.innerHTML += '<li>' + result[0] + ' at ' + (result.index + substrStart) + '</li>';
if (safety++ > 50) break;
}
}
<table>
<tr><td>Test </td><td> <input type="text" value="Hello World" id="testText" /></td></tr>
<tr><td>Regex </td><td> <input type="text" value="l." id="regexText" /></td></tr>
<tr><td>Substring Start </td><td> <input type="text" value="3" id="substrStartText" /></td></tr>
<tr><td colspan="2"><button id="RunBtn">Run</button></td></tr>
</table>
<table style="width:100%">
<tr style="font-weight:bold; background:#ccc">
<td>Results of Regex with lastIndex = 3</td>
<td>Results of string substringged</td>
</tr>
<tr>
<td><ul id="result1"></ul></td>
<td><ul id="result2"></ul></td>
</tr>
<table>
Upvotes: 8