Reputation: 111
The simple script below returns a bunch of rubbish. It works for most websites, but not william hill:
var Browser = require("zombie");
var assert = require("assert");
// Load the page from localhost
browser = new Browser()
browser.visit("http://sports.williamhill.com/bet/en-gb/betting/y/5/et/Football.html", function () {
browser.wait(function(){
console.log(browser.html());
});
});
run with node
output:
S����J����ꪙRUݒ�kf�6���Efr2�Riz�����^��0�X� ��{�^�a�yp��p�����Ή��`��(���S]-��'N�8q�����/���?�ݻ��u;�݇�ׯ�Eiٲ>��-���3�ۗG�Ee�,��mF���MI��Q�۲������ڊ�ZG��O�J�^S�C~g��JO�緹�Oݎ���P����ET�n;v������v���D�tvJn��J�8'��햷r�v:��m��J��Z�nh�]�� ����Z����.{Z��Ӳl�B'�.¶D�~$n�/��u"�z�����Ni��"Nj��\00_I\00\��S��O�E8{"�m;�h��,o��Q�y��;��a[������c��q�D�띊?��/|?:�;��Z!}��/�wے�h�<�������%������A�K=-a��~'
(actual output is much longer)
Anyone know why this happens, and specifically why it happens on the only site i actually want to scrape???
Thanks
Upvotes: 2
Views: 1421
Reputation: 111
I have abandoned this method long ago, but in case anyone is interested I got a reply from one of the zombie.js devs.
https://github.com/assaf/zombie/issues/251#issuecomment-5969175
He says: "Zombie will now send accept-encoding header to indicate it does not support gzip."
Thank you all who looked into this.
Upvotes: 1
Reputation: 15849
The same code works for other sites (which also use gzip to reply) so it's not a code problem.
My guess is the site is detecting that you are not running a browser and defending against data extraction.
Upvotes: 0