Reputation: 103
According to this golang blog post (https://blog.golang.org/strings) on strings, there is a line that states "...string values can contain arbitrary bytes; as we showed in this one, string literals always contain UTF-8 text as long as they have no byte-level escapes."
What's the difference?
Upvotes: 3
Views: 2487
Reputation: 5379
This distinction rests on the definitions in the Go language specification of a string literal and a string value:
String literal: A string literal represents a string constant obtained from concatenating a sequence of characters.
String value: A string value is a (possibly empty) sequence of bytes.
and the requirement that source code is represented in source files encoded as UTF-8 text:
Source code representation: Source code is Unicode text encoded in UTF-8.
Most crucially, note that a string literal is defined to be a constant, so its value is defined entirely at compile time.
Consider the set of valid Go source files to be those files which adhere to the language spec. By definition, such source files must be encoded using valid UTF-8. As a string literal's value is known at compile time from the contents of the source files, we see by construction that any string literal must contain valid UTF-8 text.
If we ignore byte-level escape sequences, we can see further that the compiled output of the string literal in the resulting binary must also contain valid UTF-8, as it is compiled from a UTF-8 value in the source file.
However, string values are defined to be sequences of bytes,1 so there is no requirement that they themselves contain valid UTF-8. Such non-UTF-8 text may arise from values received by the program from inputs other than string literals defined at compile time, such as data received over sockets or local pipes, read from a file, or string literals which contain byte-level escape sequences.
1As the blog post says, this sequence of bytes may be considered to be a byte slice ([]byte
) for most purposes. However, this is not strictly correct as the underlying implementation does not use a slice; strings are immutable and do not require to separately track their capacity. Calling cap(str)
where str
is of type string
is an error.
Byte-level escape sequences provide a mechanism for encoding non-UTF-8 values using a UTF-8 representation. Such sequences allow arbitrary bytes to be represented in a UTF-8 format to produce valid Go source files and satisfy the language spec.
For example, the byte sequence b2 bd
(where bytes are here represented as space-delimited two-digit hexadecimal numbers) is not valid UTF-8. Attempting to decode this byte sequence using a UTF-8 decoder will produce an error:
var seq = "\xb2\xbd"
fmt.Println("Is 'seq' valid UTF-8 text?", utf8.Valid([]byte(seq)))
$ go run ./main.go
Is 'seq' valid UTF-8 text? false
Consequently, while such a byte sequence can be stored in a string in Go, it is not possible to express directly in a Go source file. Any source file which does so is invalid Go and the lexer will reject it (see below).
A backslash-escape sequence provides a mechanism for decomposing this byte string into a valid UTF-8 representation which satisfies the toolchain. The interpreted Go string literal "\xb2\xbd"
represents the string using a sequence of ASCII characters, which can be expressed in UTF-8. When compiled, this sequence is parsed to produce a string in the compiled output containing the byte sequence b2 bd
.
I'll provide a working example to demonstrate this more concretely. As we will be generating invalid source files, I cannot readily use the Go playground for this; I used the go toolchain on my machine. The toolchain is go version go1.11 darwin/amd64
.
Consider the following simple Go program where values may be specified for the string myStr
. For convenience, we replace the whitespace (the byte sequence 20 20 20
after the string my value:
) with the desired bytes. This allows us to easily find the compiled output when disassembling binaries later.
package main
import "fmt"
func main() {
myStr := "my value: "
fmt.Println(myStr)
}
I used a hex editor to insert the invalid UTF-8 byte sequence b2 bd
into the final bytes of the whitespace in myStr
. Representation below; other lines elided for brevity:
00000030: 203a 3d20 22b2 bd20 220a 0966 6d74 2e50 := ".. "..fmt.P
Attempting to build this file results in the error:
$ go build ./test.go
# command-line-arguments
./test.go:6:12: invalid UTF-8 encoding
If we open the file in an editor, it interprets the bytes in its own way (likely to be platform-specific) and, on save, emits valid UTF-8. On this OS X-based system, using vim
, the sequence is interpreted as Unicode U+00B2 U+00BD, or ²½
. This satisfies the compiler, but the source file and hence compiled binary now contains a different byte sequence than the one originally intended. The code dump below exhibits the sequence c2 b2 c2 bd
, as does a hex dump of the source file. (See the final two bytes of line 1 and first two bytes of line 2.)
$ gobjdump -s -j __TEXT.__rodata test | grep -A1 "my value"
10c4c50 63746f72 796d7920 76616c75 653ac2b2 ctorymy value:..
10c4c60 c2bd206e 696c2065 6c656d20 74797065 .. nil elem type
To recover the original byte sequence, we can change the string definition thus:
myStr := "my value: \xb2\xbd"
Dumping the source file produced by the editor yields a valid UTF-8 sequence 5c 78 62 32 5c 78 62 64
, the UTF-8 encoding of the ASCII characters \xb2\xbd
:
00000030: 203a 3d20 226d 7920 7661 6c75 653a 205c := "my value: \
00000040: 7862 325c 7862 6422 0a09 666d 742e 5072 xb2\xbd"..fmt.Pr
Yet the binary produced by building this source file demonstrates the compiler has transformed this string literal to contain the desired b2 bd
sequence (final two bytes of the fourth column):
10c48b0 6d792076 616c7565 3a20b2bd 6e6f7420 my value: ..not
Upvotes: 9
Reputation: 337
In general, there is no difference between string value and literal. String literal is just one of ways to initialize new string variable (string as is).
Upvotes: -5