Dean
Dean

Reputation: 23

Can't get the web content with UTF-8

I'm trying to get the string by webclient and it has japanese characters but it shows these kind of characters ,�^�p�Ǘ�.

var url= "http://www.itmedia.co.jp/im/articles/0609/14/news117.html";

using (var w = new WebClient())
{
   w.Encoding = Encoding.UTF8;
   var htmlData= w.DownloadString(url);
}

The value of json_data doesn't show Japanese Characters.

Can you enlighten me why it doesn't convert to Japanese characters even if I encode it to UTF-8?

Upvotes: 1

Views: 2004

Answers (3)

Dean
Dean

Reputation: 23

I changed the code from UTF-8 to shift_jis.

w.Encoding = Encoding.GetEncoding("shift_jis");

Upvotes: 0

John Machin
John Machin

Reputation: 82934

According to 3rd line of view-source, it's encoded in shift-jis:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html lang="ja" id="masterChannel-enterprise"><head>
<meta http-equiv="content-type" content="text/html;charset=shift_jis">

Upvotes: 1

Patrick Hofman
Patrick Hofman

Reputation: 156968

If you open the page with Postman, you can see the headers of the response.

Postman

As you can see in the picture, the response is compressed with gzip. That is probably causing the scrambled response you see.

WebClientnowadays supports decompressing gzip automatically, but it wasn't that way always. (If I run your code on .NET 4.6.2 on Windows 10, I do get the right results) It might be you are targeting an older version of the .NET Framework that doesn't support gzip decompression out of the box. The linked post should solve that.

Upvotes: 0

Related Questions