Reputation: 21
I am trying to fetch data from rss feed (feed location is http://www.bgsvetionik.com/rss/ ) in c# win form. Take a look at the following code:
public static XmlDocument FromUri(string uri)
{
XmlDocument xmlDoc;
WebClient webClient = new WebClient();
using (Stream rssStream = webClient.OpenRead(uri))
{
XmlTextReader reader = new XmlTextReader(rssStream);
xmlDoc = new XmlDocument();
xmlDoc.XmlResolver = null;
xmlDoc.Load(reader);
}
return xmlDoc;
}
Although xmlDoc.InnerXml contains XML definition with UTF-8 encoding, I get š
instead of š etc.
How can I solve it?
Upvotes: 1
Views: 1332
Reputation: 1504122
The feed's data is incorrect. The š
is inside a CDATA section, so it isn't being treated as an entity by the XML parser.
If you look at the source XML, you'll find that there's a mixture of entities and "raw" characters, e.g. čišćenja
in the middle of the first title.
If you need to correct that, you'll have to do it yourself with a Replace
call - the XML parser is doing exactly what it's meant to.
EDIT: For the replacement, you could get hold of all the HTML entities and replace them one by one, or just find out which ones are actually being used. Then do:
string text = element.Value.Replace("š", "š")
.Replace(...);
Of course, this means that anything which is actually correctly escaped and should really be that text will get accidentally replaced... but such is the problem with broken data :(
Upvotes: 3