Reputation: 36277
I'm trying to get a better understanding of the relationship between a pipeline and crawler in scrapy, after asking my last question (How to pass parameter to a scrapy pipeline object)
One of the answers was:
@classmethod
def from_crawler(cls, crawler):
# Here, you get whatever value was passed through the "table" parameter
settings = crawler.settings
table = settings.get('table')
# Instantiate the pipeline with your table
return cls(table)
def __init__(self, table):
_engine = create_engine("sqlite:///data.db")
_connection = _engine.connect()
_metadata = MetaData()
_stack_items = Table(table, _metadata,
Column("id", Integer, primary_key=True),
Column("detail_url", Text),
_metadata.create_all(_engine)
self.connection = _connection
self.stack_items = _stack_items
I am confused about:
@classmethod
def from_crawler(cls, crawler):
# Here, you get whatever value was passed through the "table" parameter
settings = crawler.settings
table = settings.get('table')
Does the crawler class already exist or are we creating it here. Can someone explain what is happening here in more detail? I have been reading through a number of sources including http://scrapy.readthedocs.io/en/latest/topics/api.html#crawler-api and http://scrapy.readthedocs.io/en/latest/topics/architecture.html , but I'm not putting the pieces together yet.
Upvotes: 1
Views: 1965
Reputation: 5948
That's me again :)
Maybe what you didn't get is the meaning of classmethod
in Python. In your case, it's a method that belongs to your SQLlitePipeline
class. Thus, the cls
is the SQLlitePipeline
class itself.
Scrapy calls this pipeline method passing the crawler
object, which Scrapy instantiates by itself. Up to this point, we still don't have an SQLlitePipeline
instance yet. In other words, the pipeline flow hasn't started yet.
After getting the desired parameter (table
) from the crawler's settings, from_crawler
finally returns an instance of the pipeline by doing cls(table)
(remember what cls
is, right? So, it's the same as doing SQLlitePipeline(table)
).
This is a plain Python object instantiation, so __init__
will be called with the table name it's expecting, and then the pipeline flow will start.
EDIT
Maybe it's good to have an overview on the process executed by Scrapy, step by step. Of course it's much more complex than what I will illustrate, but hopefully it'll give you a better understanding.
1) You invoke Scrapy
2) Scrapy instantiates a crawler
object
crawler = Crawler(...)
3) Scrapy identifies the pipeline class you want to use (SQLlitePipeline
) and calls its from_crawler
method.
# Note that SQLlitePipeline is not instantiated here, as from_crawler is a class method
# However, as we saw before, this method returns an instance of the pipeline class
pipeline_instance = SQLlitePipeline.from_crawler(crawler)
4) From this point on, it calls the pipeline instance methods listed here
pipeline_instance.open_spider(...)
pipeline_instance.process_item(...)
pipeline_instance.close_spider(...)
Upvotes: 7