Reputation: 3
So basically I'm using file_get_contents()
and preg_match()
to harvest info off of a website.
There's a problem however.
I can run the program many times initially, but after a while it stops working.
Why? Because the site redirects me to one of their sponsors pages.
I think this is some kind of fail safe that if a certain ip address accesses their site too frequently in a certain time span (I've found maybe 30-40 times in a few hours), a flag goes up which redirects the specific ip to the other page.
Then I have to wait a few hours before I can access the actual page. This is bad, because at a certain point my program is going to be searching hundreds of pages, which will cause problems.
Here is the site, which is a horse racing site, that page being simply one horse profile page out of thousands.
My question is:
How do I anonymously get the file contents or somehow bypass this thing so I can go as many times as I like? Thanks.
Below I will put some code that if you so choose you can try for your self to see what happens.
It is similar to my code, only I purposely put it in a loop to waste all the "visits" quickly.
The text aspect of it is messed (It'll print all not found) but once you have executed it, visiting the site manually in your browser will now redirect you:
function hm(){
for($x=0; $x=50; $x++){
$file = file_get_contents("http://www.turf-fr.com/fiche-cheval/MONTELUPO.html",false);
if(preg_match_all("/MONTELUPO/", $file, $matches, PREG_OFFSET_CAPTURE)==true){
print "Found ";
} else {
print " not found";
$x=51;
}
}
}
hm();
Upvotes: 0
Views: 122
Reputation: 21
You can't do much about this other than switch between public proxies every 20-30 requests. Since the webserver validates user IPs, this cannot be fixed from the client side by just code changes.
Upvotes: 1