Reputation: 6684
I'm using Scrapy 0.16.4
I have used this code to change the download delay and user-agent:
DOWNLOAD_DELAY = 2
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.97 Safari/537.22 AlexaToolbar/alxg-3.1"
I'm not sure whether this is working, however, I still can't fully crawl all the pages from that site. It always gives me a random scraped items. Sometimes, I got 13, sometimes I got 30, and sometimes I got 52 scraped items.
What could be the issue?
Upvotes: 0
Views: 5309
Reputation: 4171
There may be access limits per ip for some websites. There is a great possibility that they may not accumulate the access numbers for different user agents (like chrome, firefox, ie, or safari etc.), so you may try to use a dynamic user-agent pool to alleviate the heavy accesses.
Here is a link for how to "Using random user agent in Scrapy"
Upvotes: 4
Reputation: 160
Maybe the site blocks you with a captcha, you can print the response.url and see if you're getting a referer, try to set the DOWNLOAD_DELAY to 10, you can set it into the spider and printing the url, if takes 10 seconds to print it's working.
Upvotes: 0