Paul
Paul

Reputation: 2759

PHP - Processing Invalid XML

I'm using SimpleXML to load in some xml files (which I didn't write/provide and can't really change the format of).

Occasionally (eg one or two files out of every 50 or so) they don't escape any special characters (mostly &, but sometimes other random invalid things too). This creates and issue because SimpleXML with php just fails, and I don't really know of any good way to handle parsing invalid XML.

My first idea was to preprocess the XML as a string and put ALL fields in as CDATA so it would work, but for some ungodly reason the XML I need to process puts all of its data in the attribute fields. Thus I can't use the CDATA idea. An example of the XML being:

 <Author v="By Someone & Someone" />

Whats the best way to process this to replace all the invalid characters from the XML before I load it in with SimpleXML?

Upvotes: 5

Views: 5722

Answers (3)

Mark Bradley
Mark Bradley

Reputation: 510

Despite this problem being 10 years old (for when I'm typing this), I'm still experiencing similar XML parsing issues (PHP8.1), which is why I ended up here. The answers already given are helpful, but either incomplete, inconsistent or otherwise unsuitable for my problem and I suspect for the original poster too.

Inspecting internal XML parsing issues seems right, but there are 735 error codes (see https://gnome.pages.gitlab.gnome.org/libxml2/devhelp/libxml2-xmlerror.html), so a more adaptable solution seems appropriate.

I used the word "inconsistent" above because the best of the other answers (@Adam Szmyd) mixed multibyte string handling with non-multibyte string handling.

The following code uses Adam's as the base and I reworked it for my situation, which I feel could be extended further depending on the problems actually being experienced. So, I'm not complete either - sorry!

The essence of this code is that it handles "each" (in my implementation, just 1) XML parsing error as a separate case. The error I was experiencing was an unrecognised HTML entity (&ccedil; - ç), so I use PHP entity replacement instead.

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
        return $sxe;

    $fixed_xml = '';
    $last_pos  = 0;

    // make string flat
    $xmlFlat = mb_ereg_replace( '(\r\n|\r|\n)', '', $xml );

    // Regenerate the error but using the flattened source so error offsets are directly relevant
    libxml_clear_errors();
    $xml_doc = @simplexml_load_string( $xmlFlat );

    foreach (libxml_get_errors() as $error)
    {
        $pos = $error->column - 1; // ->column appears to be 1 based, not 0 based

        switch( $error->code ) {

            case 26: // error undeclared entity
            case 27: // warning undeclared entity
                if ($pos >= 0) { // the PHP docs suggest this not always set (in which case ->column is == 0)

                    $left = mb_substr( $xmlFlat, 0, $pos );
                    $amp = mb_strrpos( $left, '&' );

                    if ($amp !== false) {

                        $entity = mb_substr( $left, $amp );
                        $fixed_xml .= mb_substr( $xmlFlat, $last_pos, $amp - $last_pos )
                            . html_entity_decode( $entity );
                        $last_pos = $pos;
                    }
                }
                break;

            default:
        }
    }
    $fixed_xml .= mb_substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}

Upvotes: 0

Adam Szmyd
Adam Szmyd

Reputation: 2973

i think workaroung for creating compute_position function will be make xml string flat before processing. Rewrite code posted by Josh:

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
    {
        return $sxe;
    }

    $fixed_xml = '';
    $last_pos  = 0;

    // make string flat
    $xml = str_replace(array("\r\n", "\r", "\n"), "", $xml);

    // get file encoding
    $encoding = mb_detect_encoding($xml);

    foreach (libxml_get_errors() as $error)
    {
        $pos = $error->column;
        $invalid_char = mb_substr($xml, $pos, 1, $encoding);
        $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($invalid_char);
        $last_pos = $pos + 1;
    }
    $fixed_xml .= substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}

I've added encoding stuff becose i've had problems with simply array[index] way of getting character from string.

This all should work but, dont know why, i've seen that $error->column gives me a different number than it should. Trying to debug this by simply add some invalid characters inside xml and check what value it would return, but no luck with it. Hope someone could tell me what is wrong with this approach.

Upvotes: 2

Josh Davis
Josh Davis

Reputation: 28730

What you need is something that will use libxml's internal errors to locate invalid characters and escape them accordingly. Here's a mockup of how I'd write it. Take a look at the result of libxml_get_errors() for error info.

function load_invalid_xml($xml)
{
    $use_internal_errors = libxml_use_internal_errors(true);
    libxml_clear_errors(true);

    $sxe = simplexml_load_string($xml);

    if ($sxe)
    {
        return $sxe;
    }

    $fixed_xml = '';
    $last_pos  = 0;

    foreach (libxml_get_errors() as $error)
    {
        // $pos is the position of the faulty character,
        // you have to compute it yourself
        $pos = compute_position($error->line, $error->column);
        $fixed_xml .= substr($xml, $last_pos, $pos - $last_pos) . htmlspecialchars($xml[$pos]);
        $last_pos = $pos + 1;
    }
    $fixed_xml .= substr($xml, $last_pos);

    libxml_use_internal_errors($use_internal_errors);

    return simplexml_load_string($fixed_xml);
}

Upvotes: 7

Related Questions