Reputation: 43
Hi I have this scrapy code below (I have deleted many if loops and simplified it to be easily understandable). The problem is this scrapy, scabs only the first page of the website. I figured out that hte scrapy.request does not get the new url, and in the item['url'] always only the base url is appended and therefore downloaded.
import scrapy
from collections import Counter
from scrapy.selector import Selector
from Mycode.items import *
class ExampleSpider(scrapy.Spider):
name = "full_sites"
def __init__(self, site=None, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
self.start_urls = [site]
self.base_url = site
self._site = site
self.allowed_domains = [self._site]
def parse(self, response):
for i in response.xpath('//a/@href').extract():
print '================'
print 'i entered=', i
url = self.base_url + i
print url, 'go to scrapy'
yield scrapy.Request(url= url, callback=self.parse)
item = FullSitesItem()
item['url'] = response.url
print 'item=', item['url']
yield item
I get these outputs on my monitor:
================
i entered= /service
http://webscraper.io/service go to scrapy
item= http://webscraper.io
================
i entered= /sitemap-specialist
http://webscraper.io/sitemap-specialist go to scrapy
item= http://webscraper.io
================
i entered= /screenshots
http://webscraper.io/screenshots go to scrapy
item= http://webscraper.io
================
so regardless of the url passed to scrapy.Requests, the item['url'] is the same! How can I fix this problem?
thanks
Upvotes: 1
Views: 516
Reputation: 1239
You are looping over i
so response.url
will always return the same result.
You can keep your parse
method like this:
def parse(self, response):
for i in response.xpath('//a/@href').extract():
print '================'
print 'i entered=', i
url = self.base_url + i
print url, 'go to scrapy'
yield scrapy.Request(url= url, callback=self.parse)
This will handle the main page.
Create a different parse method to handle the other pages, and switch callback=self.parse
to the new method.
Upvotes: 1