Reputation: 11
I'm trying to use scrapy to crawl some page with a lot of links inside, but my existing code so far only show the contents of the first link.
What mistake have I made?
from scrapy.spiders import BaseSpider
from scrapy.spiders import Spider
from scrapy.http.request import Request
from scrapy.selector import Selector
from Proje.items import ProjeItem
class ProjeSpider(BaseSpider):
name = "someweb"
allowed_domains = ["someweb.com"]
start_urls = [
"http://someweb.com/indeks/"
]
def parse(self, response):
for sel in response.xpath('//ul[@id="indeks-container"]'):
for tete in sel.xpath('//linkkk').re('//linkkk.*?(?=")'):
links = 'http:'+str(tete)
req = Request(links,callback=self.kontene)
return req
def kontene(self, response):
for mbuh in response.xpath('//head'):
Item = ProjeItem()
Item['title'] = mbuh.xpath('//title/text()').extract()
yield Item
Upvotes: 0
Views: 431
Reputation: 10349
according to the scrapy docs, parse
needs to return an interable of Request
, i.e. a list or a generator. Just change return
to yield
and it should work as expected:
def parse(self, response):
for sel in response.xpath('//ul[@id="indeks-container"]'):
for tete in sel.xpath('//linkkk').re('//linkkk.*?(?=")'):
links = 'http:'+str(tete)
req = Request(links,callback=self.kontene)
yield req
Upvotes: 1
Reputation: 807
The issue is that you have a return
statement within your for
loop. In Python, a return
will return out of the function, giving you only the first links worth of content. Instead, consider adding req
to a list of returned objects.
def parse(self, response):
req_list = []
for sel in response.xpath('//ul[@id="indeks-container"]'):
for tete in sel.xpath('//linkkk').re('//linkkk.*?(?=")'):
links = 'http:'+str(tete)
req = Request(links,callback=self.kontene)
req_list += req
return req_list
Upvotes: 1