mrm
mrm

Reputation: 27

Scrapy is crawling, but no output

I am scraping a few pages without errors, but the crawler is not generating any output. The function parse_article workes fine (I tested it separately), but together with the parse function, it doesn't create any output anymore. Any ideas?

I was running the crawler via command line: scrapy crawl all_articles_from_one_page -o test_file.csv

import scrapy
from scrapping_538.items import Scrapping538Item
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
import datetime
import socket


class BasicSpider(scrapy.Spider):
    name = 'all_articles_from_one_page'
    allowed_domains = ['web']
    start_urls = ('http://fivethirtyeight.com/features/',)

    def parse(self, response):
        # iterate through articles
        article_divs = response.xpath('//*[@id="primary"]//div[contains(@id, "post")]')
        for article in article_divs:
            print('\n**********************************************')
            article_link = article.xpath('.//h2/a/@href').extract()[0] 
            print('------article link: ' + str(article_link))
            yield scrapy.Request(article_link, callback=self.parse_article)

    def parse_article(self, response):
        il = ItemLoader(item=Scrapping538Item(), response=response)
        il.add_css('title', 'h1.article-title::text')
        il.add_css('date', 'time.datetime::text')
        il.add_css('author', '.author::text')
        il.add_css('filed_under', '.term::text')
        il.add_css('article_text', '.entry-content *::text')

        il.add_value('url', response.url)
        il.add_value('project', self.settings.get('BOT_NAME'))
        il.add_value('spider', self.name)
        il.add_value('server', socket.gethostname())
        il.add_value('date_import', datetime.datetime.now())

        return il.load_item()

Upvotes: 2

Views: 957

Answers (1)

Thiago Curvelo
Thiago Curvelo

Reputation: 3740

Change your allowed_domains to:

allowed_domains = ['fivethirtyeight.com']

Scrapy will filter any request to domain not listed on that property. Include fivethirtyeight.com to it.

(https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains)

allowed_domains

An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if OffsiteMiddleware is enabled.

Let’s say your target url is https://www.example.com/1.html, then add 'example.com' to the list.

Upvotes: 2

Related Questions