Reputation: 36307

Scrapy : Sending information to prior function

I am using scrapy 1.1 to scrape a website. The site requires periodic relogin. I can tell when this is needed because when login is required a 302 redirection occurs. Based on # http://sangaline.com/post/advanced-web-scraping-tutorial/ , I have subclassed the RedirectMiddleware, making the location http header available in the spider under:

request.meta['redirect_urls']

My problem is that after logging in , I have set up a function to loop through 100 pages to scrape . Lets say after 15 pages I see that I have to log back in (based on the contents of request.meta['redirect_urls']) . My code looks like:

def test1(self, response):

    ......
    for row in empties: # 100 records
        d = object_as_dict(row)

        AA

        yield Request(url=myurl,headers=self.headers, callback=self.parse_lookup, meta={d':d}, dont_filter=True)

def parse_lookup(self, response):

    if 'redirect_urls' in response.meta:
        print str(response.meta['redirect_urls'])

        BB

    d = response.meta['d']

So as you can see, I get 'notified' of the need to relogin in parse_lookup at BB , but need to feed this information back to cancel the loop creating requests in test1 (AA). How can I make the information in parse lookup available in the prior callback function?

Upvotes: 8

Answers (4)

Henrique Coura

Reputation: 852

Why not use a DownloaderMiddleware?

You could write a DownloaderMiddleware like so:

Edit: I have edited the original code to address a second problem the OP had in the comments.

from scrapy.http import Request

class CustomMiddleware():

    def process_response(self, request, response, spider):
        if 'redirect_urls' in response.meta:
            # assuming your spider has a method for handling the login
            original_url = response.meta["redirect_urls"][0]
            return Request(url="login_url", 
                           callback=spider.login, 
                           meta={"original_url": original_url})
        return response

So you "intercept" the response before it goes to the parse_lookup and relogin/fix what is wrong and yield new requests...

Like Tomáš Linhart said the requests are asynchronous so I don't know if you could run into problems by "reloging in" several times in a row, as multiple requests might be redirected at the same time.

Remember to add the middleware to your settings:

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 542,
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
}

Upvotes: 5

Wilfredo

Reputation: 1548

I think it should be better not to try all 100 requests all at once, instead you should try to "serialize" the requests, for example you could add all your empties in the request's meta and pop them out as necessary, or put the empties as a field of your spider.

Another alternative would be to use the scrapy-inline-requests package to accomplish what you want, but you should probably extend your middleware to perform the login.

Upvotes: 1

lufte

Reputation: 1364

Don't iterate over the 100 items and create requests for all of them. Instead, just create a request for the first item, process it in your callback function, yield the item, and only after that's done create the request for the second item and yield it. With this approach, you can check for the location header in your callback and either make the request for the next item or login and repeat the current item request.

For example:

def parse_lookup(self, response):
    if 'redirect_urls' in response.meta:
        # It's a redirect
        yield Request(url=your_login_url, callback=self.parse_login_response, meta={'current_item_url': response.request.url}
    else:
        # It's a normal response
        item = YourItem()
        ...  # Extract your item fields from the response
        yield item
        next_item_url = ...  # Extract the next page URL from the response
        yield Request(url=next_item_url, callback=self.parse_lookup)

This assumes that you can get the next item URL from the current item page, otherwise just put the list of URLs in the first request's META dict and pass it along.

Upvotes: 2

Tomáš Linhart

Reputation: 10220

You can't achieve what you want because Scrapy uses asynchronous processing.

In theory you could use approach partially suggested in comment by @Paulo Scardine, i.e. raise an exception in parse_lookup. For it to be useful, you would then have to code your spider middleware and handle this exception in process_spider_exception method to log back in and retry failed requests.

But I think better and simpler approach would be to do the same once you detect the need to login, i.e. in parse_lookup. Not sure exactly how CONCURRENT_REQUESTS_PER_DOMAIN works, but setting this to 1 might let you process one request at time and so there should be no failing requests as you always log back in when you need to.

Upvotes: 2

Scrapy : Sending information to prior function

Answers (4)

Related Questions