Reputation: 8555
Usually, when I'm replacing newlines I jump to Regexp, like in this PHP
preg_replace('/\R/u', "\n", $String);
Because I know that to be a very durable way to replace any kind of Unicode newline (be it \n, \r, \r\n, etc.)
I was trying to something like this in Go as well, but I get
error parsing regexp: invalid escape sequence:
\R
On this line
msg = regexp.MustCompilePOSIX("\\R").ReplaceAllString(html.EscapeString(msg), "<br>\n")
I tried using (?:(?>\r\n)|\v)
from https://stackoverflow.com/a/4389171/728236, but it looks like Go's regex implementation doesn't support that either, panicking with invalid or unsupported Perl syntax: '(?>'
What's a good, safe way to replace newlines in Go, Regex or not?
I see this answer here Golang: Issues replacing newlines in a string from a text file saying to use \r?\n
, but I'm hesitant to believe that it would get all Unicode newlines, mainly because of this question that has answer listing many more newline codepoints than the 3 that \r?\n
covers,
Upvotes: 2
Views: 10113
Reputation: 417682
While using regexp usually yields an elegant and compact solution, often it's not the fastest.
For tasks where you have to replace certain substrings with others, the standard library provides a really efficient solution in the form of strings.Replacer
:
Replacer replaces a list of strings with replacements. It is safe for concurrent use by multiple goroutines.
You may create a reusable replacer with strings.NewReplacer()
, where you list the pairs containing the replaceable parts and their replacements. When you want to perform a replacing, you simply call Replacer.Replace()
.
Here's how it would look like:
const replacement = "<br>\n"
var replacer = strings.NewReplacer(
"\r\n", replacement,
"\r", replacement,
"\n", replacement,
"\v", replacement,
"\f", replacement,
"\u0085", replacement,
"\u2028", replacement,
"\u2029", replacement,
)
func replaceReplacer(s string) string {
return replacer.Replace(s)
}
Here's how the regexp solution from Wiktor's answer looks like:
var re = regexp.MustCompile(`\r\n|[\r\n\v\f\x{0085}\x{2028}\x{2029}]`)
func replaceRegexp(s string) string {
return re.ReplaceAllString(s, "<br>\n")
}
The implementation is actually quite fast. Here's a simple benchmark comparing it to the above pre-compiled regexp solution:
const input = "1st\nsecond\r\nthird\r4th\u0085fifth\u2028sixth"
func BenchmarkReplacer(b *testing.B) {
for i := 0; i < b.N; i++ {
replaceReplacer(input)
}
}
func BenchmarkRegexp(b *testing.B) {
for i := 0; i < b.N; i++ {
replaceRegexp(input)
}
}
And the benchmark results:
BenchmarkReplacer-4 3000000 495 ns/op
BenchmarkRegexp-4 500000 2787 ns/op
For our test input, strings.Replacer
was more than 5 times faster.
There's also another advantage. In the example above we obtain the result as a new string
value (in both solutions). This requires a new string
allocation. If we need to write the result to an io.Writer
(e.g. we're creating an HTTP response or writing the result to a file), we can avoid having to create the new string
in case of strings.Replacer
as it has a handy Replacer.WriteString()
method which takes an io.Writer
and writes the result into it without allocating and returning it as a string
. This further significantly increases the performance gain compared to the regexp solution.
Upvotes: 5
Reputation: 626896
You may "decode" the \R
pattern as
U+000DU+000A|[U+000AU+000BU+000CU+000DU+0085U+2028U+2029]
See the Java regex docs explaining the \R
shorthand:
Linebreak matcher \R Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
In Go, you may use the following:
func removeLBR(text string) string {
re := regexp.MustCompile(`\x{000D}\x{000A}|[\x{000A}\x{000B}\x{000C}\x{000D}\x{0085}\x{2028}\x{2029}]`)
return re.ReplaceAllString(text, ``)
}
Here is a Go demo.
Some of the Unicode codes can be replaced with regex escape sequences supported by Go regexp:
re := regexp.MustCompile(`\r\n|[\r\n\v\f\x{0085}\x{2028}\x{2029}]`)
Upvotes: 5