Why does my regular expression grab the EOLN as well?

Question

I'm trying to write a batch file to automate bulk edits of some Pascal source. I have source files with the occasional line like this one:

     //{## identifier} Inc (Index) ; { a comment }    // another comment

and I want to change them all to:

     {$ifdef identifier} Inc (Index) ; { a comment }    // another comment {$endif}

Below is a test batch file I'm using.

:: File TestRXRepl.bat
:: ===================     

@echo     //{##   identifier} Inc (Index) ; { a comment }    // another comment >t.pas
@set "FindRegExp=(\ *)\/\/\{\#\#\ *([a-z,0-9,_]+)\}(\ *)(.*)"
@set "ReplRegExp=\1{$ifdef \2}\3\4 {$endif}"

rxrepl --file t.pas --output t.out --search "%FindRegExp%" --replace "%ReplRegExp%"
@type t.pas
@type t.out

The regular expression is supposed to:

capture leading indent (group 1)
match //{##
skip any spaces
capture an identifier (group 2)
match }
capture the source code indent (group 3)
capture the source line from then on to the end of the line (group 4)

Everything works except the end-of-line handling. Group 4 is supposed to capture everything from the start of the source line up until the end of line but it seems to include the end of line with the result being that the {endif} is written to the next line, i.e. I get:

{$ifdef identifier} Inc (Index) ; { a comment }    // another comment
{$endif}

rather than:

{$ifdef identifier} Inc (Index) ; { a comment }    // another comment {$endif}

The tool I am using is RXRepl. It has an option --eol which sounds like it might be useful, but I couldn't alter the behaviour with using it.

(Notes)

I know both results are syntactically correct, but that's not the point ;-)
Groups 3 and 4 could be combined.
it doesn't handle other white space characters.
There are classier ways of matching an identifier, I know.

Suggestions to make it more elegant are welcome, as well as suggestions to make it work right.

filbranden · Accepted Answer

The problem seems to be that your . is matching the newline, which means the PCRE2_DOTALL option is in effect. (I don't know why that's the case here, it's possible that rxrepl always sets that option by default.)

One possible way to work around that is to end group 4 in your regular expression match with (.*\S), using the \S character type which will match any character that's not a whitespace, and that will exclude the newline character(s).

But probably the best way to fix this is to use the \N sequence, which is described in the manual as:

The \N escape sequence has the same meaning as the "." metacharacter when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the meaning of \N.

So just using (\N*) for group 4 in your match will match everything it's currently matching, except for the trailing newline.

In your script, simply update this line:

@set "FindRegExp=(\ *)\/\/\{\#\#\ *([a-z,0-9,_]+)\}(\ *)(\N*)"

Why does my regular expression grab the EOLN as well?

Answers (1)

Related Questions