Info on Scrapy CONCURRENT_REQUESTS

I'm using Scrapy and I read on the doc about the setting "CONCURRENT_REQUESTS". The docs talk about "The maximum number of concurrent (ie. simultaneous) requests that will be performed by the Scrapy downloader."

I created a spider in order to take questions and answers from Q&A websites, so I want to know if is possible run multiple concurrent requests. Now I have set this value to 1 because I don't want to lose some Item or override someone.

The main doubt is that I have a Global ID idQuestion (for making an idQuestion.idAnswer) for any item so I don't know if making multiple requests all can be a mess and lose some Item or set wrong Ids.

This is a snippet of code:

class Scraper(scrapy.Spider):
    uid = 1


    def parse_page(self, response):
        # Scraping a single question

        item = ScrapeItem()
        hxs = HtmlXPathSelector(response)
        #item['date_time'] = response.meta['data']
        item['type'] = "Question"
        item['uid'] = str(self.uid)
        item['url'] = response.url

        #Do some scraping.
        ans_uid = ans_uid + 1
        item['uid'] = str(str(self.uid) + (":" + str(ans_uid)))
        yield item

        #Call recusivly the method on other page.
        print("NEXT -> "+str(composed_string))
        yield scrapy.Request(composed_string, callback=self.parse_page)

This is the skeleton of my code. I use uid to memorize the id for the single question and ans_uid for the answer.

Ex:

Question

1.1) Ans 1 for Question 1

1.2) Ans 2 for Question 1

1.3) Ans 3 for Question 1

Can I simply increase the CONCURRENT_REQUESTS value without compromising anything?

Upvotes: 1

Answers (2)

Xingzhou Liu

Reputation: 1559

Scrapy is not a multithreaded environment, but rather uses an event loop driven asynchronous architecture (Twisted, which is a bit like node.js for python).

in that sense, it is completely thread safe.

You actually have a reference to the request object as response -> response.request, which has response.request.url, as well as the referer header sent, and response.request.meta so you have mapping from answers back to questions built in. (like a referrer header of sorts) if you are reading from a list of questions or answers from a single page, you are guaranteed that those questions and answers will be read in order.

you can do something like the following:

class mySpider(Spider):
    def parse_answer(self, response):
        question_url = response.request.headers.get('Referer', None)
        yield Answer(question_url = ..., answerinfo = ... )

class Answer(item):
    answer = ....
    question_url = ...

Hope that helps.

Upvotes: 0

GHajba

Reputation: 3691

The answer to your question is: no. If you increase the concurrent requests you can end up having different values for uid -- even if the question is the same later. That's because there is no guarantee that your requests are handled in order.

However you can pass information along your Request objects with the meta attribute. I would pass along the ID with the yield Request(... as a meta tag and then look in the parse_page if this attribute is available or not. If it is not then it is a new question, if yes, use this id because this is not a new question.

You can read more about meta here: http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request.meta

Upvotes: 1

Info on Scrapy CONCURRENT_REQUESTS

Answers (2)

Related Questions