SIM
SIM

Reputation: 22440

Scrapy ignoring the content of second page

I've written a tiny scraper in python scrapy to parse different names from a webpage. The page has traversed 4 more pages through pagination. The total names throughout the pages are 46 but it is scraping 36 names.

The scraper is supposed to skip the content of first landing pages but using parse_start_url argument in my scraper I've handled it.

However, the problem I'm facing at this moment with this scraper is that It surprisingly skips the content of second page and parse all the rest, I meant first page, third page, fourth page and so on. Why it is happening and how to deal with that? Thanks in advance.

Here is the script I'm trying with:

import scrapy

class DataokSpider(scrapy.Spider):

    name = "dataoksp"
    start_urls = ["https://data.ok.gov/browse?page=1&f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"]

    def parse(self, response):
        for link in response.css('.pagination .pager-item a'):
            new_link = link.css("::attr(href)").extract_first()
            yield scrapy.Request(url=response.urljoin(new_link), callback=self.target_page)

    def target_page(self, response):
        parse_start_url = self.target_page  # I used this argument to capture the content of first page
        for titles in response.css('.title a'):
            name = titles.css("::text").extract_first()
            yield {'Name':name}

Upvotes: 3

Views: 390

Answers (2)

SIM
SIM

Reputation: 22440

The solution turns out to be very easy. I've fixed it already.

import scrapy

class DataokSpider(scrapy.Spider):

    name = "dataoksp"
    start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"]

    def parse(self, response):
        for f_link in self.start_urls:
            yield response.follow(url=f_link, callback=self.target_page) #this is line which fixes the issue

        for link in response.css('.pagination .pager-item a'):
            new_link = link.css("::attr(href)").extract_first()
            yield response.follow(url=new_link, callback=self.target_page)

    def target_page(self, response):
        for titles in response.css('.title a'):
            name = titles.css("::text").extract_first()
            yield {'Name':name}

Now it gives me all the results.

Upvotes: 1

Because the link you are specifying in start_urls is actually the link of the second page. If you open it, you'll see there's no <a> tag for the current page. That's why page 2 isn't reaching target_page and therefore, you should point start_urls to:

https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191

This code should help you:

import scrapy
from scrapy.http import Request


class DataokspiderSpider(scrapy.Spider):
    name = 'dataoksp'
    allowed_domains = ['data.ok.gov']
    start_urls = ["https://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191",]

    def parse(self, response):
        for titles in response.css('.title a'):
            name = titles.css("::text").extract_first()
            yield {'Name':name}

        next_page = response.xpath('//li[@class="pager-next"]/a/@href').extract_first()
        if next_page:
            yield Request("https://data.ok.gov{}".format(next_page), callback=self.parse)

Stats (see item_scraped_count):

{
    'downloader/request_bytes': 2094,
    'downloader/request_count': 6,
    'downloader/request_method_count/GET': 6,
    'downloader/response_bytes': 45666,
    'downloader/response_count': 6,
    'downloader/response_status_count/200': 6,
    'finish_reason': 'finished',
    'finish_time': datetime.datetime(2017, 9, 19, 7, 23, 47, 801934),
    'item_scraped_count': 46,
    'log_count/DEBUG': 53,
    'log_count/INFO': 7,
    'memusage/max': 47509504,
    'memusage/startup': 47509504,
    'request_depth_max': 4,
    'response_received_count': 6,
    'scheduler/dequeued': 5,
    'scheduler/dequeued/memory': 5,
    'scheduler/enqueued': 5,
    'scheduler/enqueued/memory': 5,
    'start_time': datetime.datetime(2017, 9, 19, 7, 23, 46, 59360)
}

Upvotes: 0

Related Questions