Reputation: 174
I am using a C# WinForms app to scrape some data from a webpage that uses charset ISO-8859-1. It works well for many special characters, but not all.
(* Below I use colons instead of semi-colons so that you will see the code that I see, and not the value of it)
I looked at the Page Source and I noticed that for the ones that won't display correctly, the actual code (e.g. ū:) is in the Page Source, instead of the value. For example, in the Page Source I see Ryū: Murakami, but I expect to see Ryū Murakami. Also, there are many other codes that appear as codes, such as Ş: ō: š: č: ă: ș: and many more.
I have tried using WebClient.DownloadString and WebClient.DownloadData.
Try #1 Code:
using (WebClient wc = new WebClient())
{
wc.Encoding = Encoding.GetEncoding("ISO-8859-1");
string WebPageText = wc.DownloadString("http://www.[removed].htm");
// Scrape WebPageText here
}
Try #2 Code:
Encoding iso = Encoding.GetEncoding("ISO-8859-1");
Encoding utf8 = Encoding.UTF8;
using (WebClient wc = new WebClient())
{
wc.Encoding = iso;
byte[] AllData = wc.DownloadData("http://www.[removed].htm");
byte[] utfBytes = Encoding.Convert(iso, utf8, AllData);
string WebPageText = utf8.GetString(utfBytes);
// Scrape WebPageText here
}
I want to keep the special characters, so please don't suggest any RemoveDiacritics examples. Am I missing something?
Upvotes: 0
Views: 999