HartleySan
HartleySan

Reputation: 7810

How do I use PDO in PHP to get data out of a SQL Server DB and properly encode it in a UTF-8-encoded MySQL DB?

I'm using PDO in PHP to connect to a SQL Server DB and pull data out to essentially replicate the SQL Server DB in a MySQL DB.

After some searching, I found that the collation of the SQL Server DB is SQL_Latin1_General_CP1_CI_AS, which equates to the ISO-8859-1 encoding. As such, when I connect to the SQL Server DB with PDO, I use the following:

$dsn = 'dblib:host=aws-rds-endpoint.rds.amazonaws.com:1433;dbname=db-name;charset=iso_1';

The iso_1 on the end seems to properly pull everything out of the SQL Server DB without any garbled text.

The MySQL DB is UTF-8 though, so I run the following on each string value to convert everything from ISO-8859-1 to UTF-8:

$val = isset($val) ? mb_convert_encoding($val, 'UTF-8', 'ISO-8859-1') : null;

However, when I insert the data into the MySQL DB (for which I've done the following for the DB link in PHP: mysqli_set_charset($link, 'utf8');), I am getting garbled text.
The following are some examples of garbled characters:

’ => PU2
“ => STS
” => CCH

The PU2, etc. show up in the MySQL DB with black boxes around them. (Please note that I'm using MySQL Workbench to connect to the MySQL DB.)

Basically, I don't know why it's garbling these characters or how to fix it. If I don't run the mb_convert_encoding function above, then when I insert the data into the MySQL DB, all the strings with characters like the above instantly get truncated at the first instance of a weird character and the truncated string is inserted into the MySQL DB.

As a final note, the fields that I've noticed this for in the SQL Server DB are all nvarchar with a CHARACTER_MAXIMUM_LENGTH of -1 and a CHARACTER_SET_NAME of UNICODE, not iso_1, in case that matters.

Upvotes: 0

Views: 696

Answers (1)

Rick James
Rick James

Reputation: 142278

Do not use any conversion routines, it will only add to the confusion.

utf8 Hex latin1 hex  Unicode     Char    Name
E28099   92          8217=x2019  [’]     RIGHT SINGLE QUOTATION MARK
E2809C   93          8220=x201C  [“]     LEFT DOUBLE QUOTATION MARK
E2809D   94          8221=x201D  [”]     RIGHT DOUBLE QUOTATION MARK

These are valid UTF-8 (outside MySQL) characters, and valid MySQL characters in both ut8 and utf8mb4 CHARACTER SETs.

The 8-bit latin1-utf8 mappings are here.

Most programming languages will not treat them as the same as apostrophe (') and double quote (").

This is good:

mysqli_set_charset($link, 'utf8');

But, do you also declare the column's CHARACTER SET? Let's see SHOW CREATE TABLE.

I don't recognize PU2, etc.

If you correctly tell MySQL what the client's encoding is (latin1, utf8, etc), then MySQL will correctly convert to the CHARACTER SET of the column you are putting the data in (assuming a conversion is possible). From the Comments, it sounds like the client used latin1 (eg, 92, not E8099).

Upvotes: 1

Related Questions