Reputation: 48015
I'm about to download a page encoded in UTF-8. So this is my code:
using (WebClient client = new WebClient())
{
client.Headers.Add("user-agent", Request.UserAgent);
htmlPage = client.DownloadString(HttpUtility.UrlDecode(resoruce_url));
var KeysParsed = HttpUtility.ParseQueryString(client.ResponseHeaders["Content-Type"].Replace(" ", "").Replace(";", "&"));
var charset = ((KeysParsed["charset"] != null) ? KeysParsed["charset"] : "UTF-8");
Response.Write(client.ResponseHeaders);
byte[] bytePage = Encoding.GetEncoding(charset).GetBytes(htmlPage);
using (var reader = new StreamReader(new MemoryStream(bytePage), Encoding.GetEncoding(charset)))
{
htmlPage = reader.ReadToEnd();
Response.Write(htmlPage);
}
}
so, it set UTF-8
for the encoding. But the downloaded title, for example, show in my screen as:
Sexy cover: 60 e più di “quei dischi” vietati ai minori
and not as:
Sexy cover: 60 e più di “quei dischi” vietati ai minori
somethings is wrong, but I don't find where. Any ideas?
Upvotes: 0
Views: 1458
Reputation: 134105
The problem is that by the time you get the data it's already been converted.
When WebClient.DownloadString
executes, it gets the raw bytes and converts them to a string using the default encoding. The damage is done. You can't take the resulting string, turn it back into bytes, and re-interpret it.
Put another way, this is what's happening:
// WebClient.DownloadString does, essentially, this.
byte[] rawBytes = DownloadData();
string htmlPage = Encoding.Default.GetString(rawBytes);
// Now you're doing this:
byte[] myBytes = Encoding.Utf8.GetBytes(htmlPage);
But myBytes
will not necessarily be the same as rawBytes
.
If you know what encoding to use beforehand, you can set the WebClient
instance's Encoding
property. If you want to interpret the string based on the encoding specified in the Content-Type header, then you have to download the raw bytes, determine the encoding, and use that to interpret the string. For example:
var rawBytes = client.DownloadData(HttpUtility.UrlDecode(resoruce_url));
var KeysParsed = HttpUtility.ParseQueryString(client.ResponseHeaders["Content-Type"].Replace(" ", "").Replace(";", "&"));
var charset = ((KeysParsed["charset"] != null) ? KeysParsed["charset"] : "UTF-8");
var theEncoding = Encoding.GetEncoding(charset);
htmlPage = theEncoding.GetString(rawBytes);
Upvotes: 5