Reputation: 4136
Evening,
I have HTML files that I am cleaning. These have some invalid Unicode characters that appear in my text editor like:
/B7
I want to replace these with either the character they should be, or a replacement character of my choice. For example, the /B7 character is a middot, but I want to replace it with a full-stop.
The function here: PHP - Fast way to strip all characters not displayable in browser from utf8 string
removes the invalid characters, but I am not keyed up enough on encoding to do anything more with it.
Upvotes: 0
Views: 673
Reputation: 140220
Your file is encoded very likely in Windows-1252 (where 0xB7
decodes to ·
) and gEdit is decoding it as UTF-8 and shows the invalid UTF-8 bytes (0xB7
is invalid in UTF-8 when outside a specific sequence) directly as their value I guess. You can fix the file in many ways but in PHP you could:
<?php
$file_contents = file_get_contents("brokenfile.txt");
$file_contents = mb_convert_encoding($file_contents, "UTF-8", "Windows-1252");
file_put_contents("brokenfile.txt", $file_contents);
The above script will decode the file as Windows-1252 and encode it as UTF-8.
Text editors allow you to specify what encoding to save files in usually in the 'save as' dialog or in some configuration. You should always configure your editor encodings before using it.
If you see ·
on your website after this conversion, that means you are telling the browsers that your stuff is in Windows-1252 or ISO-8859-1 etc. You must tell the browsers that your stuff is in UTF-8:
header("Content-Type: text/html; charset=utf-8");
Upvotes: 3