shenkwen
shenkwen

Reputation: 3880

How to block images from being scraped by file_get_contents or wget, and how to counter it?

My client is writing blogs on Sina Blog and she is only comfortable with its editor. So after she submit a blog I use a small snippet to scrape the images and texts to her own blog website. Its core is

$url = 'http://s5.sinaimg.cn/bmiddle/001MEJWgzy7xxRaXmDyd4&690';
$img_data = @file_get_contents($url);
$img = file_put_contents('1.jpg',$img_data);

As weird as it sounds, it did work very well and saved us both tons of time. But recently the images became all blank with some watermarks. I guess Sina finally detected our little dirty trick and block the images from being scraped. I am just curious how the block is conducted and more importantly, is there anyway to work around? I've tried using wget http://s5.sinaimg.cn/bmiddle/001MEJWgzy7xxRaXmDyd4&690 it can also only get the blank image.

Upvotes: -1

Views: 367

Answers (1)

Alex Stepanov
Alex Stepanov

Reputation: 411

Just a suggestion - the easiest (and the most likely) way a site would go about detecting a scraper is by looking at the request headers, most commonly "Accept", "Referrer" and "User-Agent". You could try copying the values that your "real" browser sends and plugging them into the wget call, like so:

Hope that helps!

Upvotes: 1

Related Questions