jhchnc
jhchnc

Reputation: 459

Very odd characters in a mysql dump -- what to do?

I've got weird data mangling my data migration. These weird characters are embedded as-is in the actual mysql dump file:


北京东方å›悦大酒店<br />\n<br />\n“The impetus

I've been given mysql data dumps with those kinds of chars in them. I'm importing the data into Drupal, by first recreating the mysql tables, and then querying against them using Drupal's Migrate module.

Code looks like this:

DROP TABLE IF EXISTS `news`;
SET @saved_cs_client     = @@character_set_client;
SET character_set_client = utf8;
CREATE TABLE `news` (
  `id` int(11) NOT NULL auto_increment,
  `uid` int(11) NOT NULL,
  `pid` int(11) default NULL,
  `puid` int(11) default NULL,
  `headline` varchar(255) NOT NULL,
  `teaser` varchar(500) NOT NULL,
  `status` char(1) default NULL,
  `date` datetime NOT NULL,
  `url` varchar(255) default NULL,
  `url_title` varchar(255) default NULL,
  `body` text,
  `caption` varchar(255) default NULL,
  `gid` int(11) default NULL,
  `feature` text,
  `related` varchar(255) default NULL,
  `change1_time` int(11) default NULL,
  `change2_time` int(11) default NULL,
  `change1_user` varchar(255) default NULL,
  `change2_user` varchar(255) default NULL,
  `expires` datetime default NULL,
  `rank` char(1) default NULL,
  PRIMARY KEY  (`id`),
  KEY `uid` (`uid`),
  KEY `status` (`status`),
  KEY `expires` (`expires`),
  KEY `rank` (`rank`),
  KEY `puid` (`puid`),
  FULLTEXT KEY `headline` (`headline`,`teaser`,`body`)
) ENGINE=MyISAM AUTO_INCREMENT=6976 DEFAULT CHARSET=utf8;
SET character_set_client = @saved_cs_client;

Fastest solution is the winner here -- I'm on a tight deadline, and really suffering over here! I've tried a search and replace solution, but there appear to be too many different types of weird data. I can orchestrate a new data dump, if I know what to tell them (how to do the data dump).

Thanks, John

Upvotes: 1

Views: 1557

Answers (1)

Ilmari Karonen
Ilmari Karonen

Reputation: 50338

This isn't a direct answer to your question, but I played a bit with the mojibake you quoted in your post. It looks like it was originally Chinese text in UTF-8 encoding, which was interpreted as Latin text in Windows-1252 encoding, re-encoded in UTF-8 and again interpreted as Windows-1252 (and finally once more encoded as UTF-8 when you posted it here). So it's not just mojibake, it's double mojibake.

Also, at some point, a byte was lost from the middle of the string (probably because it was one of the undefined code points in Windows-1252), mangling one of the original characters. Running the text through the encoding chain in reverse (encode as Windows-1252, decode as UTF-8, repeat), I get the output:

北京东方�悦大酒店<br />\n<br />\n“The impetus

where the replacement character stands for the mangled character.

Upvotes: 5

Related Questions