Reputation: 4179
We are trying to migrate database content (with a PHP script).
Content has been copied into a CMS and then written to the database. Content copied could be from any character encoding scheme (e.g. IS0-...-14) and any website.
The PHP CMS is UTF-8 so the character pasted into a textbox would be converted to UTF-8 when it was POSTed but then written to the database as Latin-1 (MSSQL db...db charset and query charset both latin-1).
We are desperately trying to think up how this could be reversed or if it is even possible (to get it so the character is fully UTF-8) in PHP.
If we can get the logic we can write an extension in C++ if PHP cant handle it (which it probably cant, mb_shite and iconv).
I keep getting lost in UTF-8 4 byte character streams (i.e. 0-127 is..ect).
Anybody got any ideas?
So far we have used PHP's ord() function to try and produce a Unicode/Acsii char ref for each char (I know ord returns ASCII but it prints character numbers over 128 which I thought was wierd if it is just meant to be ASCII, or maybe it repeats itself).
My thoughts are the latin1 will struggle to convert back to UTF-8 and will result in black diamond due to single byte char stream in Latin1 (ISO-...-1).
Upvotes: 0
Views: 5402
Reputation: 31641
If latin1
is an 8-bit clean encoding for your database (it is in MySQL, donno about MSSQL), then you don't need to do anything to reconstruct the utf-8 string. When you pull it out of your database into PHP you will get back the same bytes you put in, i.e. UTF-8.
If latin1
is not an 8-bit-clean encoding for your database then your strings are irretrievably broken. This means any characters which the database considered invalid were either dropped or replaced the moment you wrote your utf-8 string to the database. There isn't any way to recover from this.
Upvotes: 3