dotty
dotty

Reputation: 41473

'Identical' Strings are different

I have 2 strings "CHILDREN’S".

One of them is saved to a MySQL database (in fact it's a page title, from WordPress). The other is a copy and pasted version of the string from the database.

When I run var_dump on the 2 strings ( var_dump("CHILDREN’S"); var_dump($string)), the copy-pasted one is string(12) "CHILDREN’S" and the one displayed from the database is string(16) "CHILDREN’S". I'm assuming that this is a UTF-8 issue.

Can someone shed some light on why the identical strings are in fact not identical.

Upvotes: 0

Views: 216

Answers (4)

LSerni
LSerni

Reputation: 57418

"CHILDREN'S" is ten characters. To make it 12, the "'" must become an UTF-8 codepoint and that's OK.

But I see no way to get 16 characters unless the second quote is really a ’. There are no seven-bytes encodings that I know of except HTML entity.

If it is so, then html_entity_decode could be your friend.

Upvotes: 1

JvdBerg
JvdBerg

Reputation: 21856

To see how the strings really differ, you could write the hex out of every string.

For example:

$s1 = 'CHILDREN\'S';

for($i=0; $i<strlen($s1); $i++)
  echo '0x' . bin2hex(substr($s1, $i, 1)) . ' ';

This gives 0x43 0x48 0x49 0x4c 0x44 0x52 0x45 0x4e 0x27 0x53 as a result. Try the same with the string from the database, and see where it differs.

Upvotes: 0

SDC
SDC

Reputation: 14222

I would guess it's actually stored as an HTML entity in one of the versions of the string.

If it's stored as &rsquo; rather than an actual character, then it is obviously a different string length.

Bear in mind also that PHP's strlen() function is not multi-byte safe. If you've got unicode characters in there, you should probably use the mb_strlen() function instead if you want an accurate character count. This would account for why even your shorter character count is 12, when the string actually only contains 10 characters.

The additional four characters in the other copy are probably due to an HTML entity.

Upvotes: 0

Nim
Nim

Reputation: 631

This could either be an encoding problem, in which case you will want to check the database column's encoding and make sure it is what you expect it is.

Alternatively, you may have a couple of non-printable characters in the database string -- it could be that you copy/pasted some kind of nasty characters from your database tool.

Upvotes: 0

Related Questions