Samsul Islam
Samsul Islam

Reputation: 2619

Scrapy pagination is not working and optimized spider

Please help me to optimize my scrapy spider. Specially next page pagination is not working. There are lot of page per page has 50 items. I catch first page 50 items(link) in parse_items and next page items also scrap in parse_items.

import scrapy
from scrapy import Field
from fake_useragent import UserAgent 

class DiscoItem(scrapy.Item):
    release = Field()
    images = Field()


class discoSpider(scrapy.Spider):
    name = 'myspider'
    allowed_domains = ['discogs.com']
    query = input('ENTER SEARCH MUSIC TYPE : ')
    start_urls =['http://www.discogs.com/search?q=%s&type=release'%query]
   custome_settings = {
'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36",

   'handle_httpstatus_list' : [301,302,],
    'download_delay' :10}

    def start_requests(self):
        yield scrapy.Request(url=self.start_urls[0], callback=self.parse)   

    def parse(self, response):
        print('START parse \n')
        print("*****",response.url)

        #next page pagination 
        next_page =response.css('a.pagination_next::attr(href)').extract_first()
       next_page = response.urljoin(next_page)
       yield scrapy.Request(url=next_page, callback=self.parse_items2)

        headers={}
        for link in response.css('a.search_result_title ::attr(href)').extract():

        ua = UserAgent()# random user agent
        headers['User-Agent'] = ua.random
        yield scrapy.Request(response.urljoin(link),headers=headers,callback=self.parse_items) 


    def parse_items2(self, response):
        print('parse_items2 *******', response.url)
        yield scrapy.Request(url=response.url, callback=self.parse)  

    def parse_items(self,response):

        print("parse_items**********",response.url)
        items = DiscoItem()
        for imge in response.css('div#page_content'):
            img = imge.css("span.thumbnail_center img::attr(src)").extract()[0]
            items['images'] = img
            release=imge.css('div.content a ::text').extract()
            items['release']=release[4]
            yield items

Upvotes: 1

Views: 187

Answers (2)

Janib Soomro
Janib Soomro

Reputation: 632

Try this for pagination:

try:
    nextpage = response.urljoin( response.xpath("//*[contains(@rel,'next') and contains(@id,'next')]/@url")[0].extract() )
    yield scrapy.Request( nextpage, callback=self.parse )
except:
    pass

Upvotes: 0

stranac
stranac

Reputation: 28256

When I try running your code (after fixing the many indentation, spelling and letter case errors), this line is shown in scrapy's log:

2018-03-05 00:47:28 [scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.discogs.com/search/?q=rock&type=release&page=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

Scrapy will filter duplicate requests by default, and your parse_items2() method does nothing but create duplicate requests. I fail to see any reason for that method existing.

What you should do instead is specify the ˙parse()` method as callback for your requests, and avoid having an extra method that does nothing:

yield scrapy.Request(url=next_page, callback=self.parse)

Upvotes: 2

Related Questions