What is the difference between `(\S.*\S)` and `^\s*(.*)\s*$` in regex?

Question

I'm doing the RegexOne regex tutorial and it has a question about writing a regular expression to remove unnecessary whitespace.

The solution provided in the tutorial is

We can just skip all the starting and ending whitespace by not capturing it in a line. For example, the expression ^\s*(.*)\s*$ will catch only the content.

The setup for the question does indicate the use of the hat at the beginning and the dollar sign at the end, so it makes sense that this is the expression that they want:

We have previously seen how to match a full line of text using the hat ^ and the dollar sign $ respectively. When used in conjunction with the whitespace \s, you can easily skip all preceding and trailing spaces.

That said, using \S instead, I was able to come up with what seems like a simpler solution - (\S.*\S).

I've found this Stack Overflow solution that match the one in the tutorial - Regex Email - Ignore leading and trailing spaces? and I've seen other guides that recommend the same format but I'm struggling to find an explanation for why the \S is bad.

Additionally, this validates as correct in their tool... so, are there cases where this would not work as well as the provided solution? Or is the recommended version just a standard format?

CertainPerformance · Accepted Answer

The tutorial's solution of ^\s*(.*)\s*$ is wrong. The capture group .* is greedy, so it will expand as much as it can, all the way to the end of the line - it will capture trailing spaces too. The .* will never backtrack, so the \s* that follows will never consume any characters.

https://regex101.com/r/584uVG/1

Your solution is much better at actually matching only the non-whitespace content in the line, but there are a couple odd cases in which it won't match the non-space characters in the middle. (\S.*\S) will only capture at least two characters, whereas the tutorial's technique of (.*) may not capture any characters if the input is composed of all whitespace. (.*) may also capture only a single character.

But, given the problem description at your link:

Occasionally, you'll find yourself with a log file that has ill-formatted whitespace where lines are indented too much or not enough. One way to fix this is to use an editor's search a replace and a regular expression to extract the content of the lines without the extra whitespace.

From this, matching only the non-whitespace content (like you're doing) probably wouldn't remove the undesirable leading and trailing spaces. The tutorial is probably thinking to guide you towards a technique that can be used to match a whole line with a particular pattern, and then replace that line with only the captured group, like:

Match ^\s*(.*\S)\s*$, replace with $1: https://regex101.com/r/584uVG/2/

Your technique would work given the problem if you had a way to make a new text file containing only the captured groups (or all the full matches), eg:

const input = `   foo   
bar
  baz   
qux  `;
const newText = (input.match(/\S(?:$|.*\S)/gm) || [])
  .join('
');
console.log(newText);

Using \S instead of . is not bad - if one knows a particular location must be matched by a non-space character, rather than by a space, using \S is more precise, can make the intent of the pattern clearer, and can make a bad match fail faster, and can also avoid problems with catastrophic backtracking in some cases. These patterns don't have backtracking issues, but it's still a good habit to get into.

What is the difference between `(\S.\S)` and `^\s(.)\s$` in regex?

Answers (1)

Related Questions