Reputation: 57

In which file/place should Scrapy process the data?

Scrapy has several points/places where allowed processing scraped data: spider, items and spider middlewares. But I don't understand where I should do it right. I can process some scraped data in all these places. Could you explain to me differences between them in detail?

For example: downloader middleware returns some data to the spider (number, short string, url, a lot of HTML, list and other). And what and where i should do with them? I understand what to do, but is not clear where to do it...

Upvotes: 2

Answers (2)

Gallaecio

Reputation: 3857

Spiders are the main point where you define how to extract data, as items. When in doubt, implement your extraction logic in your spider only, and forget about the other Scrapy features.

Item loaders, item pipelines, downloader middlewares, spider middlewares and extensions are used mainly for code sharing in scraping projects that have several spiders.

If you ever find yourself repeating the same code in two or more spiders, and you decide to stop repeating yourself, then you should go into those components and choose which ones to use to simplify your codebase my moving existing, duplicate code into one or more components of these types.

It is generally a better approach than simply using class inheritance between Spider subclasses.

As to how to use each component:

Item loaders are for shared extraction logic (e.g. XPath and CSS selectors, regular expressions), as well as pre- and post-processing of field values.

For example:
- If you were writing spiders for websites that use some standard way of tagging the data to extract, like schema.org, you could write extraction logic on an item loader and reuse it across spiders.
- If you want to switch the value of an item field to uppercase always, you would use an output processor on the item loader class, and reuse that item loader across spiders.
Item pipelines are for post-processing of items (not just item data in a specific item).

Common use cases include dropping duplicate items (by keeping track of uniquely-identifying data of every item parsed) or sending items to database servers or other forms of storage (as a flexible alternative to feed exports).
Downloader middlewares are used for shared logic regarding the handling of request of responses.

Common use cases include implementing anti-bot software detection and handling or proxy handling. (built-in downloader middlewares)
Spider middlewares are used for any other shared logic between spiders. It is the closes to a spider base class that there is. It can handle exceptions from spiders, the initial requests, etc. (built-in spider middlewares)
Extensions are used for more general changes to Scrapy itself. (built-in extensions)

Upvotes: 3

Umair Ayub

Reputation: 21271

I will try to explain in order

Spider is the one where you decide which URLs to make requests to

DownloadMiddleware has a process_request method which is called before a request to URL is made, and it has process_response method which is called once response from that URL is received

Pipeline is the thing where data is sent when you yield a dictionary from your Spider

Upvotes: 0

In which file/place should Scrapy process the data?

Answers (2)

Related Questions