user3236034
user3236034

Reputation: 463

What is the correct way to use start_request with Scrapy, to work on website using cookies

I have a problem with scraping a website using cookies,I'm using Scrapy but I can't obtain the data correct

I need to specify a website's cookie, because when I login from the browser is asking me to select a city , to show the relevant information
I was trying with some possible solutions unsuccessfully

class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
    ]

def parse(self, response):
    request_with_cookies = Request(url="http://www.example.com",
           cookies={'currency': 'USD', 'country': 'UY'},callback=self.parse_page2)

def parse_page2(self, response):
    sel = Selector(response)
    print sel

I have no idea where to locate these functions, for example, as I can use the function start_request,

class MySpider(BaseSpider):
    name = "spider"
    allowed_domains = ""

def start_requests(self):
    return [Request(url="http://www.example.com",
       cookies={'currency': 'USD', 'country': 'UY'})]

I'm doing it this way, but I'm not sure if I'm doing this the right way How should I handle correctly the start_requests functions? How should I handle the request_with_cookies function correctly? which is the correct way to specify some cookies to a url? I should put this

name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = ["http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"]

in the class when I use start_requests or requests_with_cookies?

Upvotes: 1

Views: 343

Answers (1)

anana
anana

Reputation: 1501

Try to set the headers parameter in the request as well (cookies are headers too), like so:

Request(..., headers = {'Cookie': 'currency=USD&country=UY'}, ...)

You can also try to activate the dont_merge_cookies option in the meta parameter of Request:

Request(..., meta = {'dont_merge_cookies' : True}, ...)

This tells the crawler to ignore other cookies set by the site and only use these - in case these could be overriden by "mergeing".

I think it depends on the site's behaviour which of these will work, so try them in turn and see if they solve the problem.

Upvotes: 2

Related Questions