CrawlerRunner() not going through pipeline file of scrapy

Question

I am trying to call scrapy spider from Django Views.py file.The spider does gets invoked but its output is shown in command prompt and is not saved in Django models to render it onto the page.I checked running spider separately to verify that scrapy and Django are connected and it does work correctly,but when automated using CrawlerRunner() script it doesn't.So some component is missing in CrawlerRunner() implementation from Django views.py file. Below is the Django Views.py file which calls the spider:

@csrf_exempt
@require_http_methods(['POST', 'GET'])
def scrape(request):
import sys
from newscrawler.spiders import news_spider
from newscrawler.pipelines import NewscrawlerPipeline
from scrapy import signals
from twisted.internet import reactor
from scrapy.crawler import Crawler,CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from crochet import setup

setup()
configure_logging()

runner= CrawlerRunner(get_project_settings())
d=runner.crawl(news_spider.NewsSpider)

return redirect("../getnews/")

My spider does the work of crawling news website and saves Url,Image and Title of top news.But output is that rather than saving this three fields in models.py file it is printing in cmd. Can Anyone help?

items file from scrapy

import scrapy
from scrapy_djangoitem import DjangoItem

import sys

import os
os.environ['DJANGO_SETTINGS_MODULE'] = 'News_Aggregator.settings'

from news.models import Headline

class NewscrawlerItem(DjangoItem):
    # define the fields for your item here like:
    django_model = Headline

Pipelines file

class NewscrawlerPipeline(object):
    def process_item(self, item, spider):
        print("10000000000000000")
        item.save()
        return item

Rohan Nikam · Accepted Answer

I figured it out CrawlerRunner was not able to access settings file of my scrapy project that could enable pipelines.py of scrapy which in turn would save the data in Django MOdels file.The modified code of views.py file of django which calls spider is:

import os
import sys
from newscrawler.spiders import news_spider
from newscrawler.pipelines import NewscrawlerPipeline
from scrapy import signals
from twisted.internet import reactor
from scrapy.crawler import Crawler,CrawlerRunner
from scrapy.settings import Settings
from scrapy.utils.project import get_project_settings
from newscrawler import settings as my_settings 
from scrapy.utils.log import configure_logging
from crochet import setup

@csrf_exempt
@require_http_methods(['POST', 'GET'])
def scrape(request):
    Headline.objects.all().delete()
    crawler_settings = Settings()

    setup()
    configure_logging()
    crawler_settings.setmodule(my_settings)
    runner= CrawlerRunner(settings=crawler_settings)
    d=runner.crawl(news_spider.NewsSpider)
    time.sleep(8)
    return redirect("../getnews/")

Hope this helps anyone wanting to call scrapy spider from within the django views.py file and save the scraped data in Django Models.Thank You

CrawlerRunner() not going through pipeline file of scrapy

items file from scrapy

Pipelines file

Answers (1)

Related Questions