nakisa
nakisa

Reputation: 43

Scrapy.request does not get the new url

Hi I have this scrapy code below (I have deleted many if loops and simplified it to be easily understandable). The problem is this scrapy, scabs only the first page of the website. I figured out that hte scrapy.request does not get the new url, and in the item['url'] always only the base url is appended and therefore downloaded.

import scrapy 
from collections import Counter
from scrapy.selector import Selector
from Mycode.items import *

class ExampleSpider(scrapy.Spider):
     name = "full_sites"
     def __init__(self, site=None, *args, **kwargs):
         super(ExampleSpider, self).__init__(*args, **kwargs)
         self.start_urls = [site]
         self.base_url = site
         self._site = site
         self.allowed_domains = [self._site]

    def parse(self, response):
        for i in response.xpath('//a/@href').extract():
            print '================'
            print 'i entered=', i
            url = self.base_url + i
            print url, 'go to scrapy'
            yield scrapy.Request(url= url, callback=self.parse) 

            item = FullSitesItem()
            item['url'] = response.url
            print 'item=', item['url']
            yield item 

I get these outputs on my monitor:

================
i entered= /service
http://webscraper.io/service go to scrapy
item= http://webscraper.io
================
i entered= /sitemap-specialist
http://webscraper.io/sitemap-specialist go to scrapy
item= http://webscraper.io
================
i entered= /screenshots
http://webscraper.io/screenshots go to scrapy
item= http://webscraper.io
================ 

so regardless of the url passed to scrapy.Requests, the item['url'] is the same! How can I fix this problem?

thanks

Upvotes: 1

Views: 516

Answers (1)

Eran H.
Eran H.

Reputation: 1239

You are looping over i so response.url will always return the same result.

You can keep your parse method like this:

def parse(self, response):
    for i in response.xpath('//a/@href').extract():
        print '================'
        print 'i entered=', i
        url = self.base_url + i
        print url, 'go to scrapy'
        yield scrapy.Request(url= url, callback=self.parse) 

This will handle the main page.

Create a different parse method to handle the other pages, and switch callback=self.parse to the new method.

Upvotes: 1

Related Questions