Totem
Totem

Reputation: 7349

Scrapy scraper not scraping past 1st page

I am following a scrapy tutorial here. I have, I believe, got the same code as the tutorial, and yet my scraper only scrapes the first page, then gives the following message regarding my first Request to another page, and finishes. Have I perhaps got my second yield statement in the wrong place?

DEBUG: Filtered offsite request to 'newyork.craigslist.org': https://newyork.craigslist.org/search/egr?s=120>

2017-05-20 18:21:31 [scrapy.core.engine] INFO: Closing spider (finished)

Here is my code:

import scrapy
from scrapy import Request


class JobsSpider(scrapy.Spider):
    name = "jobs"
    allowed_domains = ["https://newyork.craigslist.org/search/egr"]
    start_urls = ['https://newyork.craigslist.org/search/egr/']

    def parse(self, response):
        jobs = response.xpath('//p[@class="result-info"]')

        for job in jobs:
            title = job.xpath('a/text()').extract_first()
            address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1]
            relative_url = job.xpath('a/@href').extract_first("")
            absolute_url = response.urljoin(relative_url)

            yield {'URL': absolute_url, 'Title': title, 'Address': address}

        # scrape all pages
        next_page_relative_url = response.xpath('//a[@class="button next"]/@href').extract_first()
        next_page_absolute_url = response.urljoin(next_page_relative_url)

        yield Request(next_page_absolute_url, callback=self.parse)

Upvotes: 1

Views: 158

Answers (1)

Totem
Totem

Reputation: 7349

Ok, so I figured it out. I had to change this line:

allowed_domains = ["https://newyork.craigslist.org/search/egr"]

to this:

allowed_domains = ["newyork.craigslist.org"]

and now it works.

Upvotes: 1

Related Questions