Reputation: 2899
I'm trying to parse an html document using the .NET WebClient but the characters I'm getting are not correct. I have configured lots of Encodings but I cant find why I´m getting it wrong:
The URL is http://www.vatican.va/archive/ESL0506/__P2.HTM.
This is my code (you can test it in a ConsoleApp)
static void Main(string[] args)
{
WebClient client = new WebClient();
client.Encoding = Encoding.GetEncoding(28591);
var htmlCode = client.DownloadString("http://www.vatican.va/archive/ESL0506/__P2.HTM");
var splittedHtml = htmlCode.Split('<').ToList();
var htmlVerses = splittedHtml.Where(x => x.StartsWith("p class=MsoNormal align=left")).ToList();
}
Then, in htmlVerses I get strings like:
"p class=MsoNormal align=left style='margin-left:0cm;text-align:left;\ntext-indent:0cm'>3 Entonces Dios dijo: «Que\nexista la luz». Y la luz existió."
Check this part: 3 Entonces Dios dijo: «Que\nexista la luz». Y la luz existió
Its not well parsed. It should be: 3 Entonces Dios dijo: «Que exista la luz». Y la luz existió.
If we check the chrome source code we get this:
Then I tried to get the source code from http://www.generateit.net/seo-tools/source-viewer/ and I'm getting the same anomally as in my app.
Its really odd, the encoding that the web page use is charset=iso-8859-1, the same that my webclient uses.
Any help would be appreciated.
Upvotes: 0
Views: 444
Reputation: 1218
HTML escapes special characters for transmission, you need to convert them back. Fortunately, .NET provides methods to automagically do that for you:
HttpUtility.HtmlDecode()
see: MSDN
If you are using .NET 4.5 then you can use WebUtility.HtmlDecode()
instead, which is already included in System.Net (see: MSDN)
Upvotes: 1