Sourabh
Sourabh

Reputation: 755

scrapy - parsing multiple times

I am trying a parse a domain that whose contents are as follows
Page 1 - contains links to 10 articles
Page 2 - contains links to 10 articles
Page 3 - contains links to 10 articles
and so on...

My job is to parse all the articles on all pages.
My thought - Parse all the pages and store links to all the articles in a list and then iterate the list and parse the links.

So far I have been able to iterate through the pages, parse and collect links to the articles. I am stuck on how to start parsing this list.

My Code so far...

import scrapy

class DhoniSpider(scrapy.Spider):
    name = "test"
    start_urls = [
            "https://www.news18.com/cricketnext/newstopics/ms-dhoni.html"
    ]
    count = 0
    def __init__(self, *a, **kw):
        super(DhoniSpider, self).__init__(*a, **kw)
        self.headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
        self.seed_urls = []

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, headers=self.headers, callback=self.parse)

    def parse(self, response):
        DhoniSpider.count += 1
        if DhoniSpider.count > 2 :
            # there are many pages, this is just to stop parsing after 2 pages
            return
        for ul in response.css('div.t_newswrap'):
            ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
            self.seed_urls.extend(ref_links)

        next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)

    def iterate_urls(self):
        for link in self.seed_urls:
            link = response.urljoin(link)
            yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)

    def parse_page(self, response):
        print("called")

how to iterate my self.seed_urls list and parse them? From where should I call my iterate_urls function?

Upvotes: 2

Views: 583

Answers (2)

Georgiy
Georgiy

Reputation: 3561

Usually in this cases there is no need to make external function like your iterate_urls:

def parse(self, response):
    DhoniSpider.count += 1
    if DhoniSpider.count > 2 :
        # there are many pages, this is just to stop parsing after 2 pages
        return
    for ul in response.css('div.t_newswrap'):
        for ref_link in ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall():
            yield scrapy.Request(response.urljoin(ref_link), headers=self.headers, callback=self.parse_page, priority = 5)

    next_page = response.css('ul.pagination li a.nxt::attr(href)').get()
    if next_page is not None:
        next_page = response.urljoin(next_page)
        yield scrapy.Request(next_page, headers=self.headers, callback=self.parse)

def parse_page(self, response):
    print("called")

Upvotes: 1

Krisz
Krisz

Reputation: 2224

You don't have to collect the links to an array, you can just yield a scrapy.Request right after you parsed them. So instead of self.seed_urls.extend(ref_links), you can modify the following function:

    def iterate_urls(self, seed_urls):
        for link in seed_urls:
            link = response.urljoin(link)
            yield scrapy.Request(link, headers=self.headers, callback=self.parse_page)

and call it:

...
        for ul in response.css('div.t_newswrap'):
            ref_links = ul.css('div.t_videos_box a.t_videosimg::attr(href)').getall()
            yield iterate_urls(ref_links)
...

Upvotes: 0

Related Questions