Reputation: 207
I have to crawl a Web Site, so I use Scrapy to do it, but I need to pass a cookie to bypass the first page (which is a kind of login page, you choose you location)
I heard on the web that you need to do this with a base Spider (not a Crawl Spider), but I need to use a Crawl Spider to do my crawling, so what do I need to do?
At first a Base Spider? then launch my Crawl spider? But I don't know if cookie will be passed between them or how do I do it? How to launch a spider from another spider?
How to handle cookie? I tried with this
def start_requests(self):
yield Request(url='http://www.auchandrive.fr/drive/St-Quentin-985/', cookies={'auchanCook': '"985|"'})
But not working
My answer should be here, but the guy is really evasive and I don't know what to do.
Upvotes: 2
Views: 1009
Reputation: 176
First, you need to add open cookies in settings.py file
COOKIES_ENABLED = True
Here is my testing spider code for your reference. I tested it and passed
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy import log
class Stackoverflow23370004Spider(CrawlSpider):
name = 'auchandrive.fr'
allowed_domains = ["auchandrive.fr"]
target_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"
def start_requests(self):
yield Request(self.target_url,cookies={'auchanCook': "985|"}, callback=self.parse_page)
def parse_page(self, response):
if 'St-Quentin-985' in response.url:
self.log("Passed : %r" % response.url,log.DEBUG)
else:
self.log("Failed : %r" % response.url,log.DEBUG)
You can run command to test and watch the console output:
scrapy crawl auchandrive.fr
Upvotes: 3
Reputation: 20748
I noticed that in your code snippet, you were using cookies={'auchanCook': '"985|"'}
, instead of cookies={'auchanCook': "985|"}
.
This should get you started:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
class AuchanDriveSpider(CrawlSpider):
name = 'auchandrive'
allowed_domains = ["auchandrive.fr"]
# pseudo-start_url
begin_url = "http://www.auchandrive.fr/"
# start URL used as shop selection
select_shop_url = "http://www.auchandrive.fr/drive/St-Quentin-985/"
rules = (
Rule(SgmlLinkExtractor(restrict_xpaths=('//ul[@class="header-menu"]',))),
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "vignette-content")]',)),
callback='parse_product'),
)
def start_requests(self):
yield Request(self.begin_url, callback=self.select_shop)
def select_shop(self, response):
return Request(url=self.select_shop_url, cookies={'auchanCook': "985|"})
def parse_product(self, response):
self.log("parse_product: %r" % response.url)
Pagination might be tricky.
Upvotes: 2