Reputation: 859
I'm trying to get html of this page
https://ec.europa.eu/esco/portal/skill?uri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Fskill%2F00735755-adc6-4ea0-b034-b8caff339c9f&conceptLanguage=en&full=true
but for some reason the output that I'm receiving is like that:
\0\0\0\0\0\0\u0003�T���0\u0010�#�\u000f�\aNM�.+�b�\"v�\u0010�\u0015+��\u001b����[�\u000e���\u001e�\v���
Here's the code:
using (WebClient client = new WebClient())
{
client.Headers.Add("Host", "ec.europa.eu");
client.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv,65.0) Gecko/20100101 Firefox/65.0");
client.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
client.Headers.Add("Accept-Language", "pl,en-US;q=0.7,en;q=0.3");
client.Headers.Add("Accept-Encoding", "gzip, deflate, br");
client.Headers.Add("DNT", "1");
client.Headers.Add("Cookie", "JSESSIONID=-(...); escoLanguage=en");
var output = client.DownloadString(new Uri("https://ec.europa.eu/esco/portal/skill?uri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Fskill%2F00735755-adc6-4ea0-b034-b8caff339c9f&conceptLanguage=en&full=true"));
}
Anybody has an idea what's causing that?
I also tried with HTML Agility pack:
var url = urls.First();
var web = new HtmlWeb();
var doc = web.Load(url);
but doc.Text
is null
Upvotes: 0
Views: 91
Reputation: 406
using (WebClient client = new WebClient())
{
client.Encoding = Encoding.UTF8;
client.Headers.Add("Host", "ec.europa.eu");
client.Headers.Add("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv,65.0) Gecko/20100101 Firefox/65.0");
client.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
client.Headers.Add("Accept-Language", "pl,en-US;q=0.7,en;q=0.3");
client.Headers.Add("Accept-Encoding", "gzip, deflate, br");
client.Headers.Add("DNT", "1");
client.Headers.Add("Cookie", "JSESSIONID=-(...); escoLanguage=en");
var downloadStr = client.DownloadData(new Uri("https://ec.europa.eu/esco/portal/skill?uri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Fskill%2F00735755-adc6-4ea0-b034-b8caff339c9f&conceptLanguage=en&full=true"));
MemoryStream stream = new MemoryStream();
using (GZipStream g = new GZipStream(new MemoryStream(downloadStr), CompressionMode.Decompress))
{
g.CopyTo(stream);
}
var output= Encoding.UTF8.GetString(stream.ToArray());
}
Since the output is compressed, it looks like that so using gzip for uncompressed.
Upvotes: 1
Reputation: 5214
The header "Accept-Encoding: gzip" may send you raw data with gzip compressed. You have to decompress the output stream manually. For example,
curl -H "Accept-Encoding: gzip" "$url" --output - | gzip -dif you are using a Linux shell.
A better solution is just to remove this header.
Upvotes: 2
Reputation: 859
Removing: client.Headers.Add("Accept-Encoding", "gzip, deflate, br");
was the solution for WebClient
Upvotes: 1