Reputation:
I have MariaDB table that has index VARCHAR(10) NOT NULL COLLATE 'utf8mb3_general_ci'
type. I have a string in Go that I cut to 10 characters, if it is longer, to insert into/update this table. I cut the string as:
if len(value) > 10 {
value = value[:10]
}
Right now I have encountered an issue with string that ends with š
character. MariaDB throws error: Error 1366: Incorrect string value: '\\xC5'
. Looking up unicode tables, this character is represented as \xc5\xa1
which makes me believe the cutting of the string somehow makes the string indigestible for the database?
I would like to avoid handling utf8/unicode in my code because that would require going through all database methods and massaging all strings. And I do not believe this is necessary since i have never needed it before. So I think the issue lies somewhere else but am not sure where.
I tried to switch the collation to utf8mb4_general_ci
but that did not help either.
Interestingly, if I edit the column directly with HeidiSQL, the string saves just fine.Which leads me to believe this might be a driver issue. I am using the github.com/go-sql-driver/mysql
, as always. So I would not expect issues but, who knows...
Upvotes: 1
Views: 1500
Reputation: 45140
which makes me believe the cutting of the string somehow makes the string indigestible for the database?
Cutting strings by sub-slicing as value[:10]
(and measuring length with len
for that matter) is always a mistake if your program has any chance of dealing with multi-byte characters. That's because indexing a string operates on its bytes, which may or may not be part of multi-byte encoding.
As you found out, the character š
is encoded in UTF-8 as \xc5\xa1
. If these two bytes appear in your value
string right at index 9
and 10
your index expression [:10]
corrupts the data.
The character sets utf8mb3
and utf8mb4
only restrict the range of admitted UTF-8 to respectively 3-byte and 4-byte characters, but \xc5
is not valid UTF-8 to begin with, so it gets rejected either way.
In MariaDB a column with data type VARCHAR(N)
counts characters (as specified by the collation). You want to cut your value
string at the tenth character, instead of at the tenth byte.
I would like to avoid handling utf8/unicode in my code
You are already admitting UTF-8 by declaring the MariaDB collation as utf8mb3
. It's only logical that you properly handle input data in your code as UTF-8. To cut at the n-th character (or rune
, which in Go represents a Unicode code point) you can use something like:
// count the runes
if utf8.RuneCountInString(value) > 10 {
// convert string to rune slice
chars := []rune(value)
// index the rune slice and convert back to string
value = string(chars[:10])
}
This won't corrupt UTF-8 encoding, however keep in mind it does more allocs and doesn't account for composed characters, e.g. when the joiner 200D
is involved.
Upvotes: 1