Reputation: 90736
Does Golang do any conversion or somehow try to interpret the bytes when casting a byte slice to a string? I've just tried with a byte slice containing a null byte and it looks like it still keep the string as it is.
var test []byte
test = append(test, 'a')
test = append(test, 'b')
test = append(test, 0)
test = append(test, 'd')
fmt.Println(test[2] == 0) // OK
But how about strings with invalid unicode points or UTF-8 encoding. Could the casting fail or the data be corrupted?
Upvotes: 4
Views: 7150
Reputation: 166529
The Go Programming Language Specification
A string type represents the set of string values. A string value is a (possibly empty) sequence of bytes.
Conversions to and from a string type
Converting a slice of bytes to a string type yields a string whose successive bytes are the elements of the slice.
string([]byte{'h', 'e', 'l', 'l', '\xc3', '\xb8'}) // "hellø" string([]byte{}) // "" string([]byte(nil)) // "" type MyBytes []byte string(MyBytes{'h', 'e', 'l', 'l', '\xc3', '\xb8'}) // "hellø"
Converting a value of a string type to a slice of bytes type yields a slice whose successive elements are the bytes of the string.
[]byte("hellø") // []byte{'h', 'e', 'l', 'l', '\xc3', '\xb8'} []byte("") // []byte{} MyBytes("hellø") // []byte{'h', 'e', 'l', 'l', '\xc3', '\xb8'}
A string value is a (possibly empty) sequence of bytes. A string value may or may not represent Unicode characters encoded in UTF-8. There is no interpretation of the bytes during the conversion from byte
slice to string
nor from string
to byte
slice. Therefore, the bytes will not be changed and the conversions will not fail.
Upvotes: 10
Reputation: 23537
No, the casting can't fail. Here's an example showing this (run in the Go Playground):
b := []byte{0x80}
s := string(b)
fmt.Println(s)
fmt.Println([]byte(s))
for _, c := range s {
fmt.Println(c)
}
This prints:
�
[128]
65533
Note that ranging over invalid UTF-8 is well defined according to the Go spec:
For a string value, the "range" clause iterates over the Unicode code points in the string starting at byte index 0. On successive iterations, the index value will be the index of the first byte of successive UTF-8-encoded code points in the string, and the second value, of type rune, will be the value of the corresponding code point. If the iteration encounters an invalid UTF-8 sequence, the second value will be 0xFFFD, the Unicode replacement character, and the next iteration will advance a single byte in the string.
Upvotes: 5