Reputation: 7349
I am following a scrapy tutorial here. I have, I believe, got the same code as the tutorial, and yet my scraper only scrapes the first page, then gives the following message regarding my first Request
to another page, and finishes. Have I perhaps got my second yield
statement in the wrong place?
DEBUG: Filtered offsite request to 'newyork.craigslist.org': https://newyork.craigslist.org/search/egr?s=120>
2017-05-20 18:21:31 [scrapy.core.engine] INFO: Closing spider (finished)
Here is my code:
import scrapy
from scrapy import Request
class JobsSpider(scrapy.Spider):
name = "jobs"
allowed_domains = ["https://newyork.craigslist.org/search/egr"]
start_urls = ['https://newyork.craigslist.org/search/egr/']
def parse(self, response):
jobs = response.xpath('//p[@class="result-info"]')
for job in jobs:
title = job.xpath('a/text()').extract_first()
address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
relative_url = job.xpath('a/@href').extract_first("")
absolute_url = response.urljoin(relative_url)
yield {'URL': absolute_url, 'Title': title, 'Address': address}
# scrape all pages
next_page_relative_url = response.xpath('//a[@class="button next"]/@href').extract_first()
next_page_absolute_url = response.urljoin(next_page_relative_url)
yield Request(next_page_absolute_url, callback=self.parse)
Upvotes: 1
Views: 158
Reputation: 7349
Ok, so I figured it out. I had to change this line:
allowed_domains = ["https://newyork.craigslist.org/search/egr"]
to this:
allowed_domains = ["newyork.craigslist.org"]
and now it works.
Upvotes: 1