Shadi Almosri
Shadi Almosri

Reputation: 11989

Parsing XML using PHP - Which includes ampersands and other characters

I'm trying to parse an XML file and one of the fields looks like the following:

<link>http://foo.com/this-platform/scripts/click.php?var_a=a&var_b=b&varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee</link>

This seems to break the parser. i think it might be something to do with the & in the link?

My code is quite simple:

<?

$xml = simplexml_load_file("files/this.xml");

echo $xml->getName() . "<br />";

foreach($xml->children() as $child) {
  echo $child->getName() . ": " . $child . "<br />";
}
?>

any ideas how i can resolve this?

Upvotes: 3

Views: 4995

Answers (6)

Klesun
Klesun

Reputation: 13673

If your XML already has some escaping, this way it will be preserved and unescaped ampersands will be fixed:

$brokenXmlText = file_get_contents("files/this.xml");
$fixed = preg_replace('/&(?!lt;|gt;|quot;|apos;|amp;|#)/', '&amp;', $brokenXmlText);
$xml = simplexml_load_string($fixed);

Upvotes: 1

gnuwings
gnuwings

Reputation: 950

I think this will help you http://www.php.net/manual/en/simplexml.examples-errors.php#96218

Upvotes: 0

Shadi Almosri
Shadi Almosri

Reputation: 11989

The comment by mjv resolved it:

Alternatively to using &, you may consider putting the urls and other XML-unfriendly content in , i.e. a Character Data block

Upvotes: 0

Pascal MARTIN
Pascal MARTIN

Reputation: 400912

Your XML feed is not valid XML : the & should be escaped as &amp;

This means you cannot use an XML parser on it :-(

A possible "solution" (feels wrong, but should work) would be to replace '&' that are not part of an entity by '&amp;', to get a valid XML string before loading it with an XML parser.


In your case, considering this :

$str = <<<STR
<xml>
  <link>http://foo.com/this-platform/scripts/click.php?var_a=a&var_b=b&varc=http%3A%2F%2Fwww.foo.com%2Fthis-section-here%2Fperf%2F229408%3Fvalue%3D0222%26some_variable%3Dmeee</link>
</xml>
STR;

You might use a simple call to str_replace, like this :

$str = str_replace('&', '&amp;', $str);

And, then, parse the string (now XML-valid) that's in $str :

$xml = simplexml_load_string($str);
var_dump($xml);

In this case, it should work...


But note that you must take care about entities : if you already have an entity like '&gt;', you must not replace it to '&amp;gt;' !

Which means that such a simple call to str_replace is not the right solution : it will probably break stuff on many XML feeds !

Up to you to find the right way to do that replacement -- maybe with some kind of regex...

Upvotes: 4

Malax
Malax

Reputation: 9604

The XML snippet you posted is not valid. Ampersands have to be escaped, this is why the parser complaints.

Upvotes: 4

Greg
Greg

Reputation: 321578

It breaks the parser because your XML is invalid - & should be encoded as &amp;.

Upvotes: 2

Related Questions