Javascript Regex: Unable to remove leading spaces in lookahead group in a multi line string

Question

I am trying the regex ^(?<=[\s]*namespace[\s]*---+\s+)(.|\s)+(?=$\s*\d+\s*rows$)/gm to extract row items from single column tabular list format string. But the leading spaces are added in the match. The \s+ operators in the lookahead and lookbehind groups do not help. Refer below:

x = `namespace
-------------------
               itm1
     itm2
  itm3
               itm4
               
(4 rows)
`
console.log(x.match(/^(?<=[\s]*namespace[\s]*---+\s+)(.|\s)+(?=$\s*\d+\s*rows$)/gm)[0].split(/\s+/))

Output is with leading and trailing spaces as separate list elements:

[ '', 'itm1', 'itm2', 'itm3', 'itm4', '' ]

But with console.log(x.match(/^(?<=[\s]*namespace[\s]*---+\s+)(.|\s)+(?=$\s*\d+\s*rows$)/gm)[0].trim().split(/\s+/)) <-- notice the trim() before the split(..), the output is:

[ 'itm1', 'itm2', 'itm3', 'itm4' ]

Why does the \s+ at the end of the lookahead group (?<=[\s]*namespace[\s]*---+\s+) not remove all the spaces before the desired matching group caught by (.|\s)+.

Wiktor Stribiżew · Accepted Answer

Root cause

The regex engine parses the string from left to right.

The regex searches for the match at the start of string, and does not find the lookbehind pattern, it fails right there, and then the next position is tested, between n and a in namespace. And so on until the newline after the -------------------.

At the location right after the \n, the newline char, there is a lookbehind pattern match, \s+ at the end of your lookbehind finds a whitespace required by \s+ pattern. Then, the rest of the pattern finds a match, too. Hence, there are 15 leading spaces in your result.

Solution

Use a consuming pattern. That is, use a capturing group. Or, make sure your consuming part starts with a non-whitespace char.

Thus,

const x = "namespace\n-------------------\n               itm1\n     itm2\n  itm3\n               itm4\n               \n(4 rows)\n";
console.log(
  x.match(/(?<=^\s*namespace\s*---+\s+)\S.*?(?=\s*$\s*\d+\s*rows$)/gms)[0].split(/\s+/)
);

Or, with a capturing group:

const x = "namespace\n-------------------\n               itm1\n     itm2\n  itm3\n               itm4\n               \n(4 rows)\n";
console.log(
  x.match(/^\s*namespace\s*---+\s+(\S.*?)(?=\s*$\s*\d+\s*rows$)/ms)[1].split(/\s+/)
);

Note on the regexps:

I replace (.|\s)+ with a mere . pattern, but added the s flag so that . could match line break chars. Please never use (.|\s)*, (.|\n)*, or (.|[\r\n])*, these are very inefficient regex patterns
I added \s* at the start of the positive lookahead so that the trailing whitespaces could be stripped from the match.
I also use a lazy dot, .*?, in both patterns to match the least amount of chars between two strings.

Javascript Regex: Unable to remove leading spaces in lookahead group in a multi line string

Answers (1)

Root cause

Solution

Related Questions