Classmethod from_crawler in scrapy

Question

I'm trying to get a better understanding of the relationship between a pipeline and crawler in scrapy, after asking my last question (How to pass parameter to a scrapy pipeline object)

One of the answers was:

@classmethod
def from_crawler(cls, crawler):
    # Here, you get whatever value was passed through the "table" parameter
    settings = crawler.settings
    table = settings.get('table')

    # Instantiate the pipeline with your table
    return cls(table)

def __init__(self, table):
    _engine = create_engine("sqlite:///data.db")
    _connection = _engine.connect()
    _metadata = MetaData()
    _stack_items = Table(table, _metadata,
                         Column("id", Integer, primary_key=True),
                         Column("detail_url", Text),
    _metadata.create_all(_engine)
    self.connection = _connection
    self.stack_items = _stack_items

I am confused about:

@classmethod
def from_crawler(cls, crawler):
    # Here, you get whatever value was passed through the "table" parameter
    settings = crawler.settings
    table = settings.get('table')

Does the crawler class already exist or are we creating it here. Can someone explain what is happening here in more detail? I have been reading through a number of sources including http://scrapy.readthedocs.io/en/latest/topics/api.html#crawler-api and http://scrapy.readthedocs.io/en/latest/topics/architecture.html , but I'm not putting the pieces together yet.

lucasnadalutti · Accepted Answer

That's me again :)

Maybe what you didn't get is the meaning of classmethod in Python. In your case, it's a method that belongs to your SQLlitePipeline class. Thus, the cls is the SQLlitePipeline class itself.

Scrapy calls this pipeline method passing the crawler object, which Scrapy instantiates by itself. Up to this point, we still don't have an SQLlitePipeline instance yet. In other words, the pipeline flow hasn't started yet.

After getting the desired parameter (table) from the crawler's settings, from_crawler finally returns an instance of the pipeline by doing cls(table) (remember what cls is, right? So, it's the same as doing SQLlitePipeline(table)).

This is a plain Python object instantiation, so __init__ will be called with the table name it's expecting, and then the pipeline flow will start.

EDIT

Maybe it's good to have an overview on the process executed by Scrapy, step by step. Of course it's much more complex than what I will illustrate, but hopefully it'll give you a better understanding.

1) You invoke Scrapy

2) Scrapy instantiates a crawler object

crawler = Crawler(...)

3) Scrapy identifies the pipeline class you want to use (SQLlitePipeline) and calls its from_crawler method.

# Note that SQLlitePipeline is not instantiated here, as from_crawler is a class method
# However, as we saw before, this method returns an instance of the pipeline class
pipeline_instance = SQLlitePipeline.from_crawler(crawler)

4) From this point on, it calls the pipeline instance methods listed here

pipeline_instance.open_spider(...)
pipeline_instance.process_item(...)
pipeline_instance.close_spider(...)

Classmethod from_crawler in scrapy

Answers (1)

Related Questions