Reputation: 53
I am struggling with nested page crawling.
I only get the items as number as the first crawled page item count.
The site structure will be like this.
Lets say Brand A has 2 Models and in first model there are 11 announcements, in second model there are 9. Brand B has 3 Models and each model has 5 announcements.
In the example above I need to get each announcement as separate item (total 35), but instead of that I get items number as Brands like Brand A with first announcement, then Brand B with first announcement.
class SiteSpider(CrawlSpider):
log.start(logfile="log.txt", loglevel="DEBUG", logstdout=None)
name = "site"
#download_delay = 2
allowed_domains = ['site.com']
start_urls = ['http://www.site.com/search.php?c=1111']
items = {}
def parse(self, response):
sel = Selector(response)
#requests =[]
brands = sel.xpath("//li[@class='class_11']")
for brand in brands:
item = SiteItem()
url = brand.xpath('a/@href')[0].extract()
item['marka'] = brand.xpath("a/text()")[0].extract()
item['marka_link'] = brand.xpath('a/@href')[0].extract()
request = Request("http://www.site.com"+url,callback=self.parse_model, meta={'item':item})
# requests.append(request)
#
yield request
def parse_model(self, response):
sel = Selector(response)
models = sel.xpath("//li[@class='class_12']")
for model in models:
item = SiteUtem(response.meta["item"])
url2 = model.xpath('a/@href')[0].extract()
item ['model'] = model.xpath("a/text()")[0].extract()
item ['model_link'] = url2
return item
Could you please help this noobie with pseudo code to implement this? I am making a mistake at foundation level I guess.
Upvotes: 5
Views: 3477
Reputation: 11396
in your parse_model
you have a loop that create items but not yielding them, try to change it to:
def parse_model(self, response):
sel = Selector(response)
models = sel.xpath("//li[@class='class_12']")
for model in models:
item = SiteUtem(response.meta["item"])
url2 = model.xpath('a/@href')[0].extract()
item ['model'] = model.xpath("a/text()")[0].extract()
item ['model_link'] = url2
yield item
Upvotes: 7