Andres
Andres

Reputation: 2899

Net WebClient Encoding not working

I'm trying to parse an html document using the .NET WebClient but the characters I'm getting are not correct. I have configured lots of Encodings but I cant find why I´m getting it wrong:

The URL is http://www.vatican.va/archive/ESL0506/__P2.HTM.

This is my code (you can test it in a ConsoleApp)

    static void Main(string[] args)
    {
        WebClient client = new WebClient();
        client.Encoding = Encoding.GetEncoding(28591);
        var htmlCode = client.DownloadString("http://www.vatican.va/archive/ESL0506/__P2.HTM");

        var splittedHtml = htmlCode.Split('<').ToList();

        var htmlVerses = splittedHtml.Where(x => x.StartsWith("p class=MsoNormal align=left")).ToList();
    }

Then, in htmlVerses I get strings like:

"p class=MsoNormal align=left style='margin-left:0cm;text-align:left;\ntext-indent:0cm'>3 Entonces Dios dijo: &laquo;Que\nexista la luz&raquo;. Y la luz existi&oacute;."

Check this part: 3 Entonces Dios dijo: &laquo;Que\nexista la luz&raquo;. Y la luz existi&oacute;

Its not well parsed. It should be: 3 Entonces Dios dijo: «Que exista la luz». Y la luz existió.

If we check the chrome source code we get this:

enter image description here

Then I tried to get the source code from http://www.generateit.net/seo-tools/source-viewer/ and I'm getting the same anomally as in my app.

Its really odd, the encoding that the web page use is charset=iso-8859-1, the same that my webclient uses.

Any help would be appreciated.

Upvotes: 0

Views: 444

Answers (1)

Setsu
Setsu

Reputation: 1218

HTML escapes special characters for transmission, you need to convert them back. Fortunately, .NET provides methods to automagically do that for you:

HttpUtility.HtmlDecode()

see: MSDN

If you are using .NET 4.5 then you can use WebUtility.HtmlDecode() instead, which is already included in System.Net (see: MSDN)

Upvotes: 1

Related Questions