Gautam Chakraborty
Gautam Chakraborty

Reputation: 61

How can I schedule scrapy spider to crawl after certain time?

I want to schedule my spider to run again in 1 hour when crawling is finished. In my code spider_closed method is calling after crawling end. Now How to run the spider again from this method. Or are there any available settings to schedule the scrapy spider.

Here is my basic spider code.

import scrapy
import codecs
from a2i.items import A2iItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy.xlib.pydispatch import dispatcher


class A2iSpider(scrapy.Spider):
    name = "notice"
    f = open("urls.txt")
    start_urls = [url.strip() for url in f.readlines()]
    f.close()
    allowed_domains = ["prothom-alo.com"]

    def __init__(self):
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def parse(self, response):

        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            print "*"*70
            print url
            print "\n\n"
            yield scrapy.Request(url, callback=self.parse_page,meta={'depth':2,'url' : url})


    def parse_page(self, response):
        filename = "response.txt"
        depth = response.meta['depth']

        with open(filename, 'a') as f:
            f.write(str(depth))
            f.write("\n")
            f.write(response.meta['url'])
            f.write("\n")

        for href in response.css("a::attr('href')"):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_page,meta={'depth':depth+1,'url' : url})


    def spider_closed(self, spider):
        print "$"*2000

Upvotes: 4

Views: 2335

Answers (2)

freezix
freezix

Reputation: 83

You can run your spider with the JOBDIR setting, it will save your requests loaded in the scheduler

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

https://doc.scrapy.org/en/latest/topics/jobs.html

Upvotes: 1

Harrison
Harrison

Reputation: 2670

You can use cron.

crontab -e to create schedule and run scripts as root, or crontab -u [user] -e to run as a specific user.

at the bottom you can add 0 * * * * cd /path/to/your/scrapy && scrapy crawl [yourScrapy] >> /path/to/log/scrapy_log.log

0 * * * * makes the script runs hourly, and I believe you can find more details about the settings online.

Upvotes: 2

Related Questions