Reputation: 59307
I have an automatically-generated XML file that is supposed to be encoded with UTF-8. For the most part, the encoding is correct. However, there are some few characters that are not encoded properly. When viewing the file in Emacs, I get \370, \351.
Is there a way to detect their characters programatically? I prefer solutions using PHP, but solutions in Perl or Java would be very helpful as well.
Upvotes: 2
Views: 1476
Reputation: 96159
You can use libxml_use_internal_errors and libxml_get_errors to loop through the errors that occurred when the document was loaded. The error code you're looking for is XML_ERR_INVALID_CHAR = 9.
<?php
$xml = '<?xml version="1.0" encoding="utf-8"?>
<a>
<b>' . chr(0xfd) . chr(0xff) . '</b>
</a>';
libxml_use_internal_errors(true);
$doc = new DOMDocument;
$doc->loadxml($xml);
foreach (libxml_get_errors() as $error) {
print_r($error);
}
libxml_clear_errors();
prints
LibXMLError Object
(
[level] => 3
[code] => 9
[column] => 5
[message] => Input is not proper UTF-8, indicate encoding !
Bytes: 0xFD 0xFF 0x3C 0x2F
[file] =>
[line] => 3
)
Upvotes: 0
Reputation: 127537
You can check for UTF-8-ness of a string with this regular expression:
(^(?:
[\x00-\x7f] |
[\xc0-\xdf][\x80-\xff] |
[\xe0-\xef][\x80-\xff]{2} |
[\xf0-\xf7][\x80-\xff]{3}
)*$)x
Upvotes: 3
Reputation: 1503280
Are you absolutely certain that the encoding is incorrect? Rather than use emacs, I'd use a binary file viewer. What are the actual bytes at the problematic position?
With Java it would be reasonably easy to detect invalid UTF-8 byte patterns. I'm not sure whether the default Charset support would handle it, but UTF-8 is pretty simple. I usually use the UTF-8 table here as a reference for valid byte sequences.
Upvotes: 1