user17644273
user17644273

Reputation:

Error 1366: Incorrect string value when inserting strings into MariaDB

I have MariaDB table that has index VARCHAR(10) NOT NULL COLLATE 'utf8mb3_general_ci' type. I have a string in Go that I cut to 10 characters, if it is longer, to insert into/update this table. I cut the string as:

if len(value) > 10 {
  value = value[:10]
}

Right now I have encountered an issue with string that ends with š character. MariaDB throws error: Error 1366: Incorrect string value: '\\xC5'. Looking up unicode tables, this character is represented as \xc5\xa1 which makes me believe the cutting of the string somehow makes the string indigestible for the database?

I would like to avoid handling utf8/unicode in my code because that would require going through all database methods and massaging all strings. And I do not believe this is necessary since i have never needed it before. So I think the issue lies somewhere else but am not sure where.

I tried to switch the collation to utf8mb4_general_ci but that did not help either.

Interestingly, if I edit the column directly with HeidiSQL, the string saves just fine.Which leads me to believe this might be a driver issue. I am using the github.com/go-sql-driver/mysql, as always. So I would not expect issues but, who knows...

Upvotes: 1

Views: 1500

Answers (1)

blackgreen
blackgreen

Reputation: 45140

which makes me believe the cutting of the string somehow makes the string indigestible for the database?

Cutting strings by sub-slicing as value[:10] (and measuring length with len for that matter) is always a mistake if your program has any chance of dealing with multi-byte characters. That's because indexing a string operates on its bytes, which may or may not be part of multi-byte encoding.

As you found out, the character š is encoded in UTF-8 as \xc5\xa1. If these two bytes appear in your value string right at index 9 and 10 your index expression [:10] corrupts the data.

The character sets utf8mb3 and utf8mb4 only restrict the range of admitted UTF-8 to respectively 3-byte and 4-byte characters, but \xc5 is not valid UTF-8 to begin with, so it gets rejected either way.

In MariaDB a column with data type VARCHAR(N) counts characters (as specified by the collation). You want to cut your value string at the tenth character, instead of at the tenth byte.

I would like to avoid handling utf8/unicode in my code

You are already admitting UTF-8 by declaring the MariaDB collation as utf8mb3. It's only logical that you properly handle input data in your code as UTF-8. To cut at the n-th character (or rune, which in Go represents a Unicode code point) you can use something like:

// count the runes
if utf8.RuneCountInString(value) > 10 {
  // convert string to rune slice
  chars := []rune(value)
  // index the rune slice and convert back to string
  value = string(chars[:10])
}

This won't corrupt UTF-8 encoding, however keep in mind it does more allocs and doesn't account for composed characters, e.g. when the joiner 200D is involved.

Upvotes: 1

Related Questions