wolfgang
wolfgang

Reputation: 7789

Simple Scrapy crawler not following links & scraping

Basically the problem is in following the links

I'm going from page 1..2..3..4..5.....90 pages in total

each page has 100 or so links

Each page is in this format

http://www.consumercomplaints.in/lastcompanieslist/page/1
http://www.consumercomplaints.in/lastcompanieslist/page/2
http://www.consumercomplaints.in/lastcompanieslist/page/3
http://www.consumercomplaints.in/lastcompanieslist/page/4

This is regex matching rule

Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")

I'm going to each page and then creating a Request object to scrape all the the links in each page

Scrapy only crawls 179 links in total each time and then gives a finished status

What am i doing wrong?

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urlparse

class consumercomplaints_spider(CrawlSpider):
    name = "test_complaints"
    allowed_domains = ["www.consumercomplaints.in"]
    protocol='http://'

    start_urls = [
        "http://www.consumercomplaints.in/lastcompanieslist/"
    ]

    #These are the rules for matching the domain links using a regularexpression, only matched links are crawled
    rules = [
        Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")
    ]


    def parse_data(self, response):
        #Get All the links in the page using xpath selector
        all_page_links = response.xpath('//td[@class="compl-text"]/a/@href').extract()

        #Convert each Relative page link to Absolute page link -> /abc.html -> www.domain.com/abc.html and then send Request object
        for relative_link in all_page_links:
            print "relative link procesed:"+relative_link

            absolute_link = urlparse.urljoin(self.protocol+self.allowed_domains[0],relative_link.strip())
            request = scrapy.Request(absolute_link,
                         callback=self.parse_complaint_page)
            return request


        return {}

    def parse_complaint_page(self,response):
        print "SCRAPED"+response.url
        return {}

Upvotes: 1

Views: 670

Answers (1)

Nabin
Nabin

Reputation: 11776

You will need to use yield instead of return.

for each new Request object, use yield request instead of return reqeust

See more about yield here and the difference between them and reason here

Upvotes: 1

Related Questions