Reputation: 974
This is the website I am crawling. I had no problem at first, but then I encountered this error.
[scrapy] DEBUG: Redirecting (meta refresh) to <GET https://www.propertyguru.com.my/distil_r_captcha.html?requestId=9f8ba25c-3673-40d3-bfe2-6e01460be915&httpReferrer=%2Fproperty-for-rent%2F1> from <GET https://www.propertyguru.com.my/property-for-rent/1>
Website knows I am a bot and redirects me to a page with a captcha code. I think handle_httpstatus_list
or dont_redirect
doesn't work because redirection isn't done with http status codes. This is my crawler's code. Is there any way to stop this redirection?
class MySpider(CrawlSpider):
name = 'myspider'
start_urls = [
'https://www.propertyguru.com.my/property-for-rent/1',
]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
}
meta = {
'dont_redirect': True
}
def parse(self, response):
items = response.css('div.header-container h3.ellipsis a.nav-link::attr(href)').getall()
if items:
for item in items:
if item.startswith('/property-listing/'):
yield scrapy.Request(
url='https://www.propertyguru.com.my{}'.format(item),
method='GET',
headers=self.headers,
meta=self.meta,
callback=self.parse_items
)
def parse_items(self, response):
from scrapy.shell import inspect_response
inspect_response(response, self)
UPDATE: I tried those settings, but they didn't work either.
custom_settings = {
'DOWNLOAD_DELAY': 5,
'DOWNLOAD_TIMEOUT': 360,
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'CONCURRENT_ITEMS': 1,
'REDIRECT_MAX_METAREFRESH_DELAY': 200,
'REDIRECT_MAX_TIMES': 40,
}
Upvotes: 5
Views: 2846
Reputation: 2927
To stop meta refresh, simply disable it in the crawler settings.py file:
METAREFRESH_ENABLED = False
https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#metarefreshmiddleware-settings
Upvotes: 2
Reputation: 1035
This website is protected by Distil Networks. They are using JavaScript to determine you are a bot. Are they letting some requests through or none at all? You may be able to have some success with Selenium, but in my experience they will catch on eventually. The solution involves randomizing the entire browser fingerprint from screen size and everything else you can think of. If anybody else has additional info I would be interested to hear about it. I'm not sure about SoF ToS on stuff like this.
If you load up a proxy like charles proxy or something so you can see everything going on you can look at all the JS they are running on you.
If they are letting 0 requests through I'd advise using Selenium to see your luck.
If they are letting some through and redirecting others my experience is over time they will eventually redirect them all. What I would do if they are letting some through is set http_retry_codes = []
Just to expand on this some more I will link to this post about over riding your navigator object with Selenium which is what contains much of your browser fingerprint. It must be done in JS and on every page load. I can't attest to its effectiveness against Distil. See this answer
The direct answer to your question thanks to other answer for completing my answer.
#settings.py
HTTP_RETRY_CODES = [404, 303, 304, ???]
RETRY_TIMES = 20
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': None,
}
In spider meta attributes for a particular request:
meta={'dont_redirect': True}
Also it's worth noting that you can in the middleware under process_response method catch the 302 and have it throw off another request. This in combination with HTTP RETRY CODES is a good way to brute force if you have a good UA list and IP source.
I suggest you try https://scrapinghub.com/crawlera . They recently raised their prices but they supply good IPs and detect bans. It really is worth it if you need to get to certain information. Their network is smart unlike most IP rotation networks that are much cheaper. They have a trial going on so you can verify if it works and its made by the developers of scrapy so follow the documentation for easy install with
pip install scrapy_crawlera
Then you can retry all of them until your rotator gives you a good IP which I suspect you will see that over a short period of time they will all be banned.
Upvotes: 5
Reputation: 739
To stop meta refresh disable download middleware MetaRefreshMiddleware in project settings by setting it's value to None:
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': None,
}
Upvotes: 1