Scrapy Crawl History

Question

How do i use scrapy to do "scheduled" crawling? What i mean is, i don't want scrapy to run continuously, i want it run for lets say 1K urls crawled, then take a break and restart.

Why i am asking that is for the following two reasons :

1- I don't want scrapy to put excessive load on the virtual machine, if i have multiple crawlers running.

Should I even be bothered about scrapy taking up too much ram?

2- if scrapy crawling fails for some reason, how do i restart from where it left off? does it do it automatically? Or do i have to restart from scratch again ?

The second point i am very concerned about.

Rejected · Accepted Answer

There's a section in the documentation on this: Jobs: Pausing and Resuming Crawls.

As to memory, as long as you're not doing something that will keep objects alive (or storing all of your results in memory), then usually memory isn't a huge issue. It's all data passing through and being discarded (with some exceptions).

By default, Scrapy does not save its state as it's crawling, see the link above for more details on doing so.

Scrapy Crawl History

Answers (1)

Related Questions