Emacs, using replace-regexp-in-string to match two regexps

Question

I'm trying to replace two parts of a string using replace-regexp-in-string but I can only get one part to work at a time. Here is an example where I want to remove the # and spaces from the beginning and the newline from the end of the string. What am I doing wrong when I combine the two calls into one expression?

;; Test string
(setq inputStr "## Header Stuff
")

;; This doesnt trim the newline
(setq header
      (replace-regexp-in-string "^[#\s]*\|
$" "" inputStr) )

;; Each match done separately works though
(setq header
      (replace-regexp-in-string "^[#\s]*" "" inputStr) )
(setq header
      (replace-regexp-in-string "
$" "" header) )

header
"Header Stuff"

UPDATE: the problem seems to be with the first expression, for example this replaces the newline and "S" with "X", (replace-regexp-in-string "S\| $" "X" inputStr).

user725091 · Accepted Answer

It looks like replace-regexp-in-string has some unexpected behavior with regexps which match the empty string. The following regexp does what you would expect (note the + quantifier in place of *):

(let ((input-string "## Header Stuff
"))
  (replace-regexp-in-string "\`[#\s]+\|
*\'" "" input-string))

The reason lies in the internal implementation of replace-regexp-in-string, which you can look up using M-x find-function. In pseudocode, it does approximately the following:

Given a regexp, a replacement, and a string:

Set l to the length of the string and start to 0. Create an empty stack called matches to accumulate pieces of the new string.
As long as start is less than l and regexp matches somewhere within string, do the following:
1. Extract the portion of string that matched the regexp, and call it str.
2. Replace regexp with replacement, within the shorter string str (this is important)
3. Push the following two fragments of the new string onto the matches stack:
  - the unmatched initial portion of string, from start to the beginning of the match
  - the substring str, in which the match for regexp has now been replaced by replacement
4. Set start to the end of the matched portion and repeat.
Finally, join up the string fragments on the matches stack in reverse order and return the result.

The problem with your original regexp happens at step (3) of the loop. Even though the regexp correctly matches the newline at the end of the complete string "## Header stuff ", when it is matched a second time against the one-character string " ", the first branch of the alternative -- which matches an empty string -- takes priority over the second, and it replaces the empty string with the empty string, failing to remove the trailing new-line.

This is arguably a bug in replace-regexp-in-string, but it also shows how tricky regexp semantics can be, especially when empty strings are involved. To me, the workaround solution is easier to read and understand:

(let ((input-string "## Header Stuff
"))
  (setq input-string (replace-regexp-in-string "\`[#\s]*" "" input-string))
  (setq input-string (replace-regexp-in-string "
*\'" "" input-string))
  input-string)

If you have a very recent Emacs (pretest 24.4 or higher), you can also use the string-trim-right function from the builtin subr-x package:

(let ((input-string "## Header Stuff
"))
  (string-trim-right (replace-regexp-in-string "\`[#\s]*" "" input-string)))

By the way, I was surprised to find out while investigating this that \s in Emacs strings is just a different way of writing the space character. If you want regexp behavior similar to Perl's \s wildcard, you might want to use "\s-" (match any character with whitespace syntax), or "[[:space:]]".

Emacs, using replace-regexp-in-string to match two regexps

Answers (1)

Related Questions