Justanotherpath
Justanotherpath

Reputation: 156

Can't download utf-8 web content

I have simple code for getting response from a vietnamese website: http://vnexpress.net , but there is a small problem. For the first time, it downloads ok, but after that, the content contains unknown symbols like this:�\b\0\0\0\0\0\0�\a`I�%&/m.... What is the problem?

    string address = "http://vnexpress.net";
    WebClient webClient = new WebClient();
    webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11 AlexaToolbar/alxg-3.1");
    webClient.Encoding = System.Text.Encoding.UTF8;
    return webClient.DownloadString(address);

Upvotes: 3

Views: 2997

Answers (3)

Jim Mischel
Jim Mischel

Reputation: 133975

You'll find that the response is GZipped. There doesn't appear to be a way to download that with WebClient, unless you create a derived class and modify the underlying HttpWebRequest to allow automatic decompression.

Here's how you'd do that:

    public class MyWebClient : WebClient
    {
        protected override WebRequest GetWebRequest(Uri address)
        {
            var req = base.GetWebRequest(address) as HttpWebRequest;
            req.AutomaticDecompression = DecompressionMethods.GZip;
            return req;
        }
    }

And to use it:

string address = "http://vnexpress.net";
MyWebClient webClient = new MyWebClient();
webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11 AlexaToolbar/alxg-3.1");
webClient.Encoding = System.Text.Encoding.UTF8;
return webClient.DownloadString(address);

Upvotes: 9

EricLaw
EricLaw

Reputation: 57075

DownloadString requires that the server correctly indicate the charset in the Content-Type response header. If you watch in Fiddler, you'll see that the server instead sends the charset inside a META Tag in the HTML response body:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />   

If you need to handle responses like this, you need to either parse the HTML yourself or use a library like FiddlerCore to do this for you.

Upvotes: 0

Rafik Bari
Rafik Bari

Reputation: 5037

try with code and you'll be fine:

string address = "http://vnexpress.net";

WebClient webClient = new WebClient();

webClient.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.2; WOW64)   AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11 AlexaToolbar/alxg-3.1");

return Encoding.UTF8.GetString(Encoding.Default.GetBytes(webClient.DownloadString(address)));             

Upvotes: 1

Related Questions