mottosson
mottosson

Reputation: 3763

Efficient Regexp matching starting from given index within string

I have already parsed a string up to index idx. My next parse step uses a Regexp. It needs to match the next part of the string, i.e. staring from position idx . How do I do this efficiently?

For example:

let myString = "<p>ONE</p><p>TWO</p>"
let idx

// some code not shown here parses the first paragraph
// and updates idx
idx = 10

// next parse step must continue from idx 
let myRegex = /<p>[^<]*<\/p>/
let subbed = myString.substring(idx)
let result = myRegex.exec(subbed)
console.log(result) // "<p>TWO</p>", not "<p>ONE</p>"

But myString.substring(idx) seems like a quite expensive operation.

Are there no regex operations like this: result = myRegex.execFromIndex(idx, myString);?

In general, I want to start regex matching from different indexes so I can exclude parts of the string and avoid matches that are already parsed. So one time it can be from myString[0] another time myString[51] and so on.

Is there a way to do this efficiently? I'm parsing hundreds of thousands of lines and want to do this in an as cheap way as possible.

Upvotes: 13

Views: 4270

Answers (2)

Inigo
Inigo

Reputation: 15030

Use Regexp.exec and lastIndex

  1. Create a Regexp with the y or g flag
    • with the y flag, the match must start exactly at the specified start index
    • with the g flag, the match can occur anywhere after the specified index
  2. Set its lastIndex property to the start index
  3. Call exec

I've applied the above steps to your example code:

let myString = "<p>ONE</p><p>TWO</p>"
let idx

// some code not shown here parses the first paragraph
// and updates idx
idx = 10

// next parse step must continue from idx 
let myRegex = /<p>[^<]*<\/p>/y  // 🚩note the 'y' flag!🚩
myRegex.lastIndex = idx
let result = myRegex.exec(myString)
console.log(result) // "<p>TWO</p>", not "<p>ONE</p>"

Another useful thing to know is that exec will update lastIndex to point to the position in the string after the returned match. This allows you to do many things, including:

  1. Rerun the same Regexp, which will automatically find the next match after that last match.
  2. Transfer the lastIndex value to a different Regexp, if the next thing you want to parse has a different pattern.
  3. Copy the lastIndex value into a variable used by your non-regex parsing.
  4. Return lastIndex to the caller of your function, so the caller can proceed with the rest of the string however it wants.

Why string.slice and substring are good solutions too

But myString.substring(idx) seems like a quite expensive operation.

Not necessarily so! Although they probably won't be as fast as Rust, all the leading Javascript engines (SpiderMonkey, V8, JavaScriptCore) do exactly what you describe for Rust. They optimize string.slice and substring behind the scenes, using pointers into the source string rather than making copies.

Adventures in the land of substrings and RegExps has a lot of great detail, pictures and analysis, but it is five years old and things have likely gotten even better since. There is the answer to this StackOverflow question: Is Javascript substring virtual?

Upvotes: 9

Daniel Gimenez
Daniel Gimenez

Reputation: 20504

A JavaScript Regexp has a lastIndex property that is used in Regexp.exec() as a placeholder that contains the index of the last match, show it knows where to start next. So setting myRegex.lastIndex = 3 should solve your problem.

It's more efficient than substring method because it doesn't need to create an extra variable and setting the lastIndex property is probably a quicker operation than doing a substring. Everything else is the exactly the same as you were doing.

Below is a test since that shows that setting lastIndex will produce the same result as doing the substring first.

var result1Elem = document.getElementById('result1');
var result2Elem = document.getElementById('result2');
var runBtn = document.getElementById('RunBtn');
runBtn.addEventListener("click", runTest);
function runTest() {
  var substrStart = +document.getElementById('substrStartText').value
  var myRegex1 = new RegExp(document.getElementById('regexText').value, 'g');
  myRegex1.lastIndex = substrStart;
  var myRegex2 = new RegExp(document.getElementById('regexText').value, 'g');

  var myString1 = document.getElementById('testText').value;
  var myString2 = myString1.substring(3);
  
  var result;
  
  var safety = 0;
  while ((result = myRegex1.exec(myString1)) !== null) {
    result1Elem.innerHTML += '<li>' + result[0] + ' at ' + result.index + '</li>';
    if (safety++ > 50) break;
  }
  
  safety = 0;
  while ((result = myRegex2.exec(myString2)) !== null) {
    result2Elem.innerHTML += '<li>' + result[0] + ' at ' + (result.index + substrStart)  + '</li>';
    if (safety++ > 50) break;
  }
}
<table>
<tr><td>Test </td><td> <input type="text" value="Hello World" id="testText" /></td></tr>
<tr><td>Regex </td><td> <input type="text" value="l." id="regexText" /></td></tr>
<tr><td>Substring Start </td><td> <input type="text" value="3" id="substrStartText" /></td></tr>
<tr><td colspan="2"><button id="RunBtn">Run</button></td></tr>
</table>

<table style="width:100%">
  <tr style="font-weight:bold; background:#ccc">
    <td>Results of Regex with lastIndex = 3</td>
    <td>Results of string substringged</td>
  </tr>
  <tr>
    <td><ul id="result1"></ul></td>
    <td><ul id="result2"></ul></td>
  </tr>
<table>

Upvotes: 8

Related Questions