insecte
insecte

Reputation: 11

Why only one result in loop scrapy

I'm trying to use scrapy to crawl some page with a lot of links inside, but my existing code so far only show the contents of the first link.

What mistake have I made?

from scrapy.spiders import BaseSpider
from scrapy.spiders import Spider
from scrapy.http.request import Request
from scrapy.selector import Selector
from Proje.items import ProjeItem

class ProjeSpider(BaseSpider):
    name = "someweb"
    allowed_domains = ["someweb.com"]
    start_urls = [
        "http://someweb.com/indeks/"
    ]

def parse(self, response):
    for sel in response.xpath('//ul[@id="indeks-container"]'):
        for tete in sel.xpath('//linkkk').re('//linkkk.*?(?=")'):
           links = 'http:'+str(tete)
           req = Request(links,callback=self.kontene)
           return req

def kontene(self, response):
    for mbuh in response.xpath('//head'):
        Item = ProjeItem()
        Item['title'] = mbuh.xpath('//title/text()').extract()
        yield Item

Upvotes: 0

Views: 431

Answers (2)

miraculixx
miraculixx

Reputation: 10349

according to the scrapy docs, parse needs to return an interable of Request, i.e. a list or a generator. Just change return to yield and it should work as expected:

def parse(self, response):
    for sel in response.xpath('//ul[@id="indeks-container"]'):
        for tete in sel.xpath('//linkkk').re('//linkkk.*?(?=")'):
           links = 'http:'+str(tete)
           req = Request(links,callback=self.kontene)
           yield req

Upvotes: 1

Jenner Felton
Jenner Felton

Reputation: 807

The issue is that you have a return statement within your for loop. In Python, a return will return out of the function, giving you only the first links worth of content. Instead, consider adding req to a list of returned objects.

def parse(self, response):
    req_list = []
    for sel in response.xpath('//ul[@id="indeks-container"]'):
        for tete in sel.xpath('//linkkk').re('//linkkk.*?(?=")'):
           links = 'http:'+str(tete)
           req = Request(links,callback=self.kontene)
           req_list += req
    return req_list

Upvotes: 1

Related Questions