Reputation: 782

Scrapy Start_request parse

I am writing a scrapy script to search and scrape result from a website. I need to search items from website and parse each url from the search results. I started with Scrapy's start_requests where i'd pass the search query and redirect to another function parse which will retrieve the urls from the search result. Finally i called another function parse_item to parse the results. I'm able to extract the all the search results url but i'm not being able to parse the results ( parse_item is not working). Here is the code:

# -*- coding: utf-8 -*-

from scrapy.http.request import Request
from scrapy.spider import BaseSpider

class xyzspider(BaseSpider):
    name = 'dspider'
    allowed_domains = ["www.example.com"]
    mylist = ['Search item 1','Search item 2']
    url = 'https://example.com/search?q='

    def start_requests(self):
        for i in self.mylist:
            i = i.replace(' ','+')
            starturl = self.url+ i

            yield Request(starturl,self.parse)

    def parse(self,response):
        itemurl =  response.xpath(".//section[contains(@class, 'search-results')]/a/@href").extract()
        for j in itemurl:
            print j
            yield Request(j,self.parse_item)

    def parse_item(self,response):
        print "hello"

        '''rating = response.xpath(".//ul(@class = 'ratings')/li[1]/span[1]/text()").extract()
        print rating'''

Could anyone please help me. Thank you.

Upvotes: 4

Answers (2)

Evhz

Reputation: 9246

Your code looks good. So you might need to use the Request attribute dont_filter set to True:

yield Request(j,self.parse_item, dont_filter=True)

From the docs:

dont_filter (boolean) – indicates that this request should not be filtered by the scheduler. This is used when you want to perform an identical request multiple times, to ignore the duplicates filter. Use it with care, or you will get into crawling loops. Default to False.

Anyway I recommend you to have a look at the item Pipelines. Those are used to process scraped items using the command:

yield my_object

Item pipelines are used to post-process everything yielded by the spider.

Upvotes: 1

sulav_lfc

Reputation: 782

I was getting a Filtered offsite request error. I changed the allowed domain from allowed_domains = www.xyz.com to xyz.com and it worked perfectly.

Upvotes: 1

Scrapy Start_request parse

Answers (2)

Related Questions