Kohjah Breese
Kohjah Breese

Reputation: 4136

Replace Invalid UTF-8, Not Replace

Evening,

I have HTML files that I am cleaning. These have some invalid Unicode characters that appear in my text editor like:

/B7

I want to replace these with either the character they should be, or a replacement character of my choice. For example, the /B7 character is a middot, but I want to replace it with a full-stop.

The function here: PHP - Fast way to strip all characters not displayable in browser from utf8 string

removes the invalid characters, but I am not keyed up enough on encoding to do anything more with it.

Upvotes: 0

Views: 673

Answers (1)

Esailija
Esailija

Reputation: 140220

Your file is encoded very likely in Windows-1252 (where 0xB7 decodes to ·) and gEdit is decoding it as UTF-8 and shows the invalid UTF-8 bytes (0xB7 is invalid in UTF-8 when outside a specific sequence) directly as their value I guess. You can fix the file in many ways but in PHP you could:

<?php
$file_contents = file_get_contents("brokenfile.txt");
$file_contents = mb_convert_encoding($file_contents, "UTF-8", "Windows-1252");
file_put_contents("brokenfile.txt", $file_contents);

The above script will decode the file as Windows-1252 and encode it as UTF-8.

Text editors allow you to specify what encoding to save files in usually in the 'save as' dialog or in some configuration. You should always configure your editor encodings before using it.

If you see · on your website after this conversion, that means you are telling the browsers that your stuff is in Windows-1252 or ISO-8859-1 etc. You must tell the browsers that your stuff is in UTF-8:

header("Content-Type: text/html; charset=utf-8");

Upvotes: 3

Related Questions