Reputation: 887
I need to scrape this page (the ads): https://www.sahibinden.com/en/cars/used?date=1day&a5_min=2005&a5_max=2020
When I open it too many times I get blocked, changing the IP doesn't help either. The problem is that when I open this page from browser from my PC it works just fine. But it seems to get blocked from webkit.
await page.route("**/*", (route) => {
if (!firstReq) route.continue();
else {
firstReq = false;
route.continue({
method: method,
postData: data,
headers: headers,
});
}
});
let pageRes = await page.goto(url);
await page.waitForNavigation()
await page.unroute("**/*");
return pageRes;
I realize that it is site trying to block bots but what are the practices to avoid that. I have tried waits, ip rotation as well as useragent rotation - nothing seems to be working
Upvotes: 1
Views: 531
Reputation: 8861
In their Terms of Use §4.11 they state that it is not allowed to scrape their content:
The use of the whole or any part of the "Portal" for [...] Automatic program on the site, robot, spider, web crawler , spider, data mining, data crawling etc. "screen scraping" software or systems, using automated tools or manual processes, [...] such uses will be prevented at the discretion of the OWNER. [...]
So you can be sure they are doing their best to prevent scraping.
There are methods to workaround the blocks, I advise you to read Thomas Dondorf's great answer on the topic of headless browsers and blocking by reCaptcha. I also strongly recommend to consider his first option in the current case:
Option 1: Stop crawling or try to use an official API. As the owner of the page does not want you to crawl that page, you could simply respect that decision and stop crawling. Maybe there is a documented API that you can use.
In general there can be huge differences in scraper recognition between visiting the site in headless vs headful mode, using slowMo
option of launch()
or not.
Upvotes: 1