Reputation: 7789
Basically the problem is in following the links
I'm going from page 1..2..3..4..5.....90 pages in total
each page has 100 or so links
Each page is in this format
http://www.consumercomplaints.in/lastcompanieslist/page/1
http://www.consumercomplaints.in/lastcompanieslist/page/2
http://www.consumercomplaints.in/lastcompanieslist/page/3
http://www.consumercomplaints.in/lastcompanieslist/page/4
This is regex matching rule
Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")
I'm going to each page and then creating a Request
object to scrape all the the links in each page
Scrapy only crawls 179 links in total each time and then gives a finished
status
What am i doing wrong?
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urlparse
class consumercomplaints_spider(CrawlSpider):
name = "test_complaints"
allowed_domains = ["www.consumercomplaints.in"]
protocol='http://'
start_urls = [
"http://www.consumercomplaints.in/lastcompanieslist/"
]
#These are the rules for matching the domain links using a regularexpression, only matched links are crawled
rules = [
Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")
]
def parse_data(self, response):
#Get All the links in the page using xpath selector
all_page_links = response.xpath('//td[@class="compl-text"]/a/@href').extract()
#Convert each Relative page link to Absolute page link -> /abc.html -> www.domain.com/abc.html and then send Request object
for relative_link in all_page_links:
print "relative link procesed:"+relative_link
absolute_link = urlparse.urljoin(self.protocol+self.allowed_domains[0],relative_link.strip())
request = scrapy.Request(absolute_link,
callback=self.parse_complaint_page)
return request
return {}
def parse_complaint_page(self,response):
print "SCRAPED"+response.url
return {}
Upvotes: 1
Views: 670