Nikolan
Nikolan

Reputation: 21

UTF-8 encoding issue

I am trying to fetch data from rss feed (feed location is http://www.bgsvetionik.com/rss/ ) in c# win form. Take a look at the following code:

public static XmlDocument FromUri(string uri) 
     {

        XmlDocument xmlDoc;
        WebClient webClient = new WebClient();

        using (Stream rssStream = webClient.OpenRead(uri))
        {
            XmlTextReader reader = new XmlTextReader(rssStream);
            xmlDoc = new XmlDocument();
            xmlDoc.XmlResolver = null;
            xmlDoc.Load(reader);
        }
        return xmlDoc;
   }

Although xmlDoc.InnerXml contains XML definition with UTF-8 encoding, I get š instead of š etc.

How can I solve it?

Upvotes: 1

Views: 1332

Answers (1)

Jon Skeet
Jon Skeet

Reputation: 1504122

The feed's data is incorrect. The š is inside a CDATA section, so it isn't being treated as an entity by the XML parser.

If you look at the source XML, you'll find that there's a mixture of entities and "raw" characters, e.g. čišćenja in the middle of the first title.

If you need to correct that, you'll have to do it yourself with a Replace call - the XML parser is doing exactly what it's meant to.

EDIT: For the replacement, you could get hold of all the HTML entities and replace them one by one, or just find out which ones are actually being used. Then do:

string text = element.Value.Replace("š", "š")
                           .Replace(...);

Of course, this means that anything which is actually correctly escaped and should really be that text will get accidentally replaced... but such is the problem with broken data :(

Upvotes: 3

Related Questions