Reputation: 268424
When I come across a broken RSS feed, the usual reason its all blown to pieces is because line 23 says "Sanford & Sons."
The most confusing thing is the fact that if you convert the &
into &
, all is well, even though your alternative still contains the problem character.
Why does RSS fail at rendering the ampersand (&
) character by default?
Upvotes: 13
Views: 10037
Reputation: 20931
In PHP, you can solve this problem with html_entity_decode()
(Source: PHP.net), like so...
$xml_line =
'<description>' .
str_replace(
['<', '>',],
['<', '>',],
html_entity_decode($description)
) .
'</description>';
Don't forget that you'll need to swap <
and >
back to their equivalents so that they don't break the DOM XML.
If you find the equivalent of html_entity_decode()
for whatever language you are using, you'll be on your way.
Upvotes: 0
Reputation: 300728
When a 'raw' &
is seen, the interpreter is looking for one of the valid escaped & sequences (such as '&'
). When an invalid sequence is found it throws an error. That's all there is to it.
Upvotes: 15
Reputation: 416081
Because rss is an XML-based format and in xml the ampersand (&) signifies the start of an xml entity. The parser is expecting something else there.
You could argue that it should be smart enough to know that the ampersand in "Sanford & Sons"
is just an ampersand. But what about when you really want to show ampersand with text? Is "&pc;
some custom (also invalid) entity, or should it interpret that as an ampersand also? What about "&amp;"
?
Upvotes: 6
Reputation: 451
Not sure if this helps but when I needed to solve this problem I used the numeric entity ref for an ampersand which is & Running this through the w3c validator passed so I guess it's ok to use this.
Cheers
Upvotes: 0
Reputation: 82794
The & is a remainder of XML's roots in SGML. There the &...; syntax is used to escape all kinds of stuff, even whole documents to embed. Therefore if you want to use a literal "&" you have to escape it. It is the same as using quotes inside strings in any programming language.
There is no use in letting XML do some kind of error correction of the kind "If there is no letter following, output a literal &", because that would break the SGML syntax XML is, as said, based on.
That it is done so in HTML by most browsers is, because they said, that it's better for users to see anything than an SGML parse error. But this opens a whole new box of Pandora of which browser does what kind of error corrections. Look at the HTML5 spec and you'll see what it means to really define error handling. It's lots of text.
One special case: You can include a literal "&" in XML/RSS, if you enclose it in a so-called "CDATA" section. That will look like the following:
<item> <![CDATA[ Smith & Wesson ]]> </item>
Cheers,
Upvotes: 3
Reputation: 8158
Because RSS is XML, and XML demands certain characters be escaped, such as the ampersand.
Upvotes: 2
Reputation: 56439
This depends highly on the RSS client, but most likely it's attempting to XML-decode the contents (in your example "Sanford & Sons"). When that happens, & indicates an escaped character. If you don't use &
as it decodes, it will try to use the next few characters to complete the escape sequence. Odds are highly likely that it will fail.
Upvotes: 1
Reputation: 124742
Because it must be escaped in XML syntax. Same reason here.
http://myst-technology.com/public/item/11878
Upvotes: 4