Reputation: 43
I am currently migrating data from MySQL 5.6.41 on Windows, to MySQL 8.0.21 on Windows. Overall, a pretty smooth migration, with a couple of very frustrating hiccups. There's one table that looks like this:
CREATE TABLE `domains` (
`intDomainID` int(11) NOT NULL AUTO_INCREMENT,
`txtDomain` varchar(100) NOT NULL,
`dtDateTime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`blnTor` int(1) NOT NULL DEFAULT '0',
`txtTLD` varchar(100) NOT NULL,
PRIMARY KEY (`intDomainID`),
UNIQUE KEY `txtDomain` (`txtDomain`)
ENGINE=InnoDB AUTO_INCREMENT=10127897 DEFAULT CHARSET=utf8mb4;
The CREATE SCHEMA is complete, and was created by Workbench's "Copy to Clipboard" --> "Create Schema" function.
When I used the built in Workbench export/import, the import always failed with "Duplicate value in txtDomain" (paraphrasing here) error, which is weird because the original table has a UNIQUE KEY constraint on that field, so there cannot be duplicates, and I confirmed, the values it was finding as duplicates were NOT duplicates in the original database.
I then dumped the table using SELECT ... INTO OUTFILE
, moved the file over to the new server, and did a LOAD DATE INFILE
. This also failed with the same "Duplicate value in txtDomain" error.
I then removed the UNIQUE KEY constraint, and redid the LOAD DATE INFILE
. This worked, the data is there. However, I cannot add back the UNIQUE KEY constraint due to "duplicates". I investigated and found this:
Query result on MySQL 5.6.41:
Query result on MySQL 8.0.21:
Now, what is going on? The table definition, the database, table and field charset/collations are identical. I need that UNIQUE KEY constraint back...
Why is http://deepdot35wvmeyd5.onion:80
== http://dee-p---dot35w-vmeyd5.onion:80
??
In case it helps, my export command:
SELECT * INTO OUTFILE 'S:\\temp\\domains.txt'
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
FROM domains;
My import command:
LOAD DATA INFILE 'E:\\DB Backup\\ServerMigration\\domains.txt'
INTO TABLE domains
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\r\n';
COLLATIONS: Collations Old Server: utf8_general_ci [I don't remember touching this value] New Server: utf8mb4_0900_ai_ci [I didn't touch this value] DB old/new are the same: utf8mb4_0900_ai_ci Table old/new are the same: utf8mb4_0900_ai_ci
This is how the raw TXT file looks like on the file-system:
Note, if I paste in one of the URLs from the screenshot into here, it magically turns into the "correct" value, without the dashes: i.e.: http://deepdot35wvmeyd5.onion:80
Note2: Using Notepad++, if I convert a regular "dash" to HEX I get "2D". However, if I convert one from the URLs that's causing trouble, I get HEX "C2AD". So it seems that I'm dealing with a weird unicode character and not a dash?
Note3: If anyone wants a small sample file, here it is: https://www.dropbox.com/s/1ssbl95t2jgn2xy/domains_small.zip
Upvotes: 4
Views: 437
Reputation: 20550
The character in question is U+00AD "SOFT HYPHEN" - a non-printable character that is used to signal a possible hyphenation point inside a word.
It seems that the COLLATION used handles these characters differently on the new setup (MySQL 8.0 with default collation settings) than it did on the old setup (MySQL 5.7 with default collation settings):
These nonprintable characters are now ignored in a comparison.
You can test the difference with this simple fiddle. The comparison is "0" in 5.6, but "1" in MySQL 8.0 -> https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=a9bd0bf7de815dc14a886c5069bd1a0f
Note that the SQL fiddle also uses a default collation configuration when it's not specified explicitly.
You might fix that by setting a binary UTF-8 collation for the txtDomain column, which is actually what you want for technical strings anyway:
CREATE TABLE `domains` (
`intDomainID` int(11) NOT NULL AUTO_INCREMENT,
`txtDomain` varchar(100) NOT NULL
CHARACTER SET utf8mb4
COLLATE utf8mb4_binary_ci,
`dtDateTime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
`blnTor` int(1) NOT NULL DEFAULT '0',
`txtTLD` varchar(100) NOT NULL,
PRIMARY KEY (`intDomainID`),
UNIQUE KEY `txtDomain` (`txtDomain`)
) ENGINE=InnoDB AUTO_INCREMENT=10127897 DEFAULT CHARSET=utf8mb4;
UPDATE: As it turns out, the COLLATION must have been different between the old (5.6) and new (8.0) setup, as utf8mb4_0900_ai_ci
was introduced with MySQL 8.0. The old collation must have been utf8mb4_general_ci
, which when applied shows the desired behaviour in MySQL 8.0, too.
But still, you should use binary collation for technical strings like URLs anyways.
Upvotes: 2