notnoop
notnoop

Reputation: 59307

Wrong Mixed Character Encoding in XML

I have an automatically-generated XML file that is supposed to be encoded with UTF-8. For the most part, the encoding is correct. However, there are some few characters that are not encoded properly. When viewing the file in Emacs, I get \370, \351.

Is there a way to detect their characters programatically? I prefer solutions using PHP, but solutions in Perl or Java would be very helpful as well.

Upvotes: 2

Views: 1476

Answers (3)

VolkerK
VolkerK

Reputation: 96159

You can use libxml_use_internal_errors and libxml_get_errors to loop through the errors that occurred when the document was loaded. The error code you're looking for is XML_ERR_INVALID_CHAR = 9.

<?php
$xml = '<?xml version="1.0" encoding="utf-8"?>
<a>
    <b>' . chr(0xfd) . chr(0xff) . '</b>
</a>';
libxml_use_internal_errors(true);

$doc = new DOMDocument;
$doc->loadxml($xml);

foreach (libxml_get_errors() as $error) {
    print_r($error);
}
libxml_clear_errors();

prints

LibXMLError Object
(
    [level] => 3
    [code] => 9
    [column] => 5
    [message] => Input is not proper UTF-8, indicate encoding !
Bytes: 0xFD 0xFF 0x3C 0x2F

    [file] => 
    [line] => 3
)

Upvotes: 0

Martin v. L&#246;wis
Martin v. L&#246;wis

Reputation: 127537

You can check for UTF-8-ness of a string with this regular expression:

(^(?:
[\x00-\x7f] |
[\xc0-\xdf][\x80-\xff] |
[\xe0-\xef][\x80-\xff]{2} |
[\xf0-\xf7][\x80-\xff]{3}
)*$)x

Upvotes: 3

Jon Skeet
Jon Skeet

Reputation: 1503280

Are you absolutely certain that the encoding is incorrect? Rather than use emacs, I'd use a binary file viewer. What are the actual bytes at the problematic position?

With Java it would be reasonably easy to detect invalid UTF-8 byte patterns. I'm not sure whether the default Charset support would handle it, but UTF-8 is pretty simple. I usually use the UTF-8 table here as a reference for valid byte sequences.

Upvotes: 1

Related Questions