azz0r
azz0r

Reputation: 3311

Replace Word Screw ups Via PHP

Content people have been using Word and pasting things into the old unicode system. I'm now trying to go UTF8.

However, upon importing the data there are characters I cannot get rid of.

I have tried the following stackoverflow thread and none of the functions provided fix this string: http://snipplr.com/view.php?codeview&id=11171 / How to replace Microsoft-encoded quotes in PHP

String: Danâ??s back for more!!

Upvotes: 0

Views: 637

Answers (1)

Pascal MARTIN
Pascal MARTIN

Reputation: 401182

In this kind of situation, I generally start with the string I have copy-pasted from word :

$str = 'Danâ’s back !';
var_dump($str);


And, going byte-by-byte in it, I output the hexadecimal code of each byte :

for ($i=0 ; $i<strlen($str) ; $i++) {
    $byte = $str[$i];
    $char = ord($byte);
    printf('%s:0x%02x ', $byte, $char);
}

Which gives an output such as this one :

D:0x44 a:0x61 n:0x6e �:0xc3 �:0xa2 �:0xe2 �:0x80 �:0x99 s:0x73 :0x20 b:0x62 a:0x61 c:0x63 k:0x6b :0x20 !:0x21 


Then, with a bit of guessing, luck, and trial-and-error, you'll find out that :

  • â is a character that fits on two bytes : 0xc3 0xa2
  • and the special-quote is a character that fits on three bytes : 0xe2 0x80 0x99

Hint : it's easier when you don't have two special characters following each other ;-)


After that, it's only a matter of using str_replace to replace the correct sequence of bytes by another character ; for example, to replace the special-quote by a normal one :

var_dump(str_replace("\xe2\x80\x99", "'", $str));

Will give you :

string 'Danâ's back !' (length=14)

Upvotes: 3

Related Questions