markzzz
markzzz

Reputation: 48015

Why UTF-8 fail on this encoding?

I'm about to download a page encoded in UTF-8. So this is my code:

using (WebClient client = new WebClient())
{
    client.Headers.Add("user-agent", Request.UserAgent);

    htmlPage = client.DownloadString(HttpUtility.UrlDecode(resoruce_url));

    var KeysParsed = HttpUtility.ParseQueryString(client.ResponseHeaders["Content-Type"].Replace(" ", "").Replace(";", "&"));
    var charset = ((KeysParsed["charset"] != null) ? KeysParsed["charset"] : "UTF-8");
    Response.Write(client.ResponseHeaders);

    byte[] bytePage = Encoding.GetEncoding(charset).GetBytes(htmlPage);
    using (var reader = new StreamReader(new MemoryStream(bytePage), Encoding.GetEncoding(charset)))
    {
        htmlPage = reader.ReadToEnd();
        Response.Write(htmlPage);
    }
}

so, it set UTF-8 for the encoding. But the downloaded title, for example, show in my screen as:

Sexy cover: 60 e più di “quei dischi” vietati ai minori

and not as:

Sexy cover: 60 e più di “quei dischi” vietati ai minori

somethings is wrong, but I don't find where. Any ideas?

Upvotes: 0

Views: 1458

Answers (1)

Jim Mischel
Jim Mischel

Reputation: 134105

The problem is that by the time you get the data it's already been converted.

When WebClient.DownloadString executes, it gets the raw bytes and converts them to a string using the default encoding. The damage is done. You can't take the resulting string, turn it back into bytes, and re-interpret it.

Put another way, this is what's happening:

// WebClient.DownloadString does, essentially, this.
byte[] rawBytes = DownloadData();
string htmlPage = Encoding.Default.GetString(rawBytes);

// Now you're doing this:
byte[] myBytes = Encoding.Utf8.GetBytes(htmlPage);

But myBytes will not necessarily be the same as rawBytes.

If you know what encoding to use beforehand, you can set the WebClient instance's Encoding property. If you want to interpret the string based on the encoding specified in the Content-Type header, then you have to download the raw bytes, determine the encoding, and use that to interpret the string. For example:

var rawBytes = client.DownloadData(HttpUtility.UrlDecode(resoruce_url));
var KeysParsed = HttpUtility.ParseQueryString(client.ResponseHeaders["Content-Type"].Replace(" ", "").Replace(";", "&"));
var charset = ((KeysParsed["charset"] != null) ? KeysParsed["charset"] : "UTF-8");

var theEncoding = Encoding.GetEncoding(charset);
htmlPage = theEncoding.GetString(rawBytes);

Upvotes: 5

Related Questions