Scrapy item pipelines parallel or sequential execution of process_item

Question

I'm developing a scrapy spider, which is successfully yielding some items. Those items should be inserted into a database using pymysql. Because the data is relational, for every item I have to execute a few insertion statements. I'd like to call connection.commit() after every entire insertion to make sure that occuring errors do not cause inconsistent entries in my database.

I'm currently wondering whether scrapy will call process_item parallel for more than one item or sequentially for one item after another. If the latter is the case, I could simply use the following approach:

def process_item(self, item, spider):
    # execute insert statements
    connection.commit()

If more than one call to process_item is executed at the same time by scrapy, the call to commit() at the end could be called while another item is not fully inserted.

The documentation for item pipelines states:

After an item has been scraped by a spider, it is sent to the Item Pipeline which processes it through several components that are executed sequentially.

But I'm not quite sure whether that means that process_item will never be executed in parallel, or just that different Pipelines would always be executed one after another (for example Dropping Duplicates -> Changing something -> DB Insertion).

I think that process_item will be executed sequentially, as the documentation shows the following example:

class DuplicatesPipeline(object):

def __init__(self):
    self.ids_seen = set()

def process_item(self, item, spider):
    if item['id'] in self.ids_seen:
        raise DropItem("Duplicate item found: %s" % item)
    else:
        self.ids_seen.add(item['id'])
        return item

In this code, there is no synchronization for adding the id to ids_seen involved, nevertheless I don't know if the example is simplified because it only demonstrates how to use pipelines.

elacuesta · Accepted Answer

Documentation for the CONCURRENT_ITEMS setting specifies that items are processed in paralell (at least within a single response). I think setting it to 1 might help in your case.

I'm no expert on this part of Scrapy, but I believe this is where it happens.

Scrapy item pipelines parallel or sequential execution of process_item

Answers (1)

Related Questions