Reputation: 27
I am scraping a few pages without errors, but the crawler is not generating any output.
The function parse_article
workes fine (I tested it separately), but together with the parse
function, it doesn't create any output anymore. Any ideas?
I was running the crawler via command line: scrapy crawl all_articles_from_one_page -o test_file.csv
import scrapy
from scrapping_538.items import Scrapping538Item
from scrapy.loader import ItemLoader
from scrapy.loader.processors import MapCompose
import datetime
import socket
class BasicSpider(scrapy.Spider):
name = 'all_articles_from_one_page'
allowed_domains = ['web']
start_urls = ('http://fivethirtyeight.com/features/',)
def parse(self, response):
# iterate through articles
article_divs = response.xpath('//*[@id="primary"]//div[contains(@id, "post")]')
for article in article_divs:
print('\n**********************************************')
article_link = article.xpath('.//h2/a/@href').extract()[0]
print('------article link: ' + str(article_link))
yield scrapy.Request(article_link, callback=self.parse_article)
def parse_article(self, response):
il = ItemLoader(item=Scrapping538Item(), response=response)
il.add_css('title', 'h1.article-title::text')
il.add_css('date', 'time.datetime::text')
il.add_css('author', '.author::text')
il.add_css('filed_under', '.term::text')
il.add_css('article_text', '.entry-content *::text')
il.add_value('url', response.url)
il.add_value('project', self.settings.get('BOT_NAME'))
il.add_value('spider', self.name)
il.add_value('server', socket.gethostname())
il.add_value('date_import', datetime.datetime.now())
return il.load_item()
Upvotes: 2
Views: 957
Reputation: 3740
Change your allowed_domains
to:
allowed_domains = ['fivethirtyeight.com']
Scrapy will filter any request to domain not listed on that property. Include fivethirtyeight.com
to it.
(https://docs.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.allowed_domains)
allowed_domains
An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list (or their subdomains) won’t be followed if
OffsiteMiddleware
is enabled.Let’s say your target url is
https://www.example.com/1.html
, then add'example.com'
to the list.
Upvotes: 2