Reputation: 2626
unless I put CONCURRENT_REQUESTS_PER_DOMAIN=1 I end up having items from the pages I crawl taking attributes from other pages.
I suspect this might come from the fact I generate Requests "by hand" in the parse_chapter section, but I'm not sure and would like to understand how scrapy operates.
here is the relevant portion code
rules = (
Rule(LxmlLinkExtractor(allow_domains=allowed_domains,
restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a',
process_value=process_links), callback='parse_chapter'),
)
def parse_chapter(self, response):
item = TogItem()
item['chaptertitle'] = response.xpath('.//*[@id="chapter_num"]/text()').extract()
pages = int(response.xpath('.//*[@id="head"]/span/text()').extract()[0])
for p in range(1, pages + 1):
page_absolute_url = urlparse.urljoin(response.url, str(p) + '.html')
print("page_absolute_url: {}".format(page_absolute_url))
**yield Request(page_absolute_url, meta={'item': item}, callback=self.parse_pages, dont_filter=True)**
def parse_pages(self, response):
item = response.request.meta['item']
item['pagenumber'] = response.xpath('.//*[@id="chapter_page"]/text()').extract()
print(item['pagenumber'])
images = response.xpath('//*[@id="image"]/@src')
images_absolute_url = []
for ie in images:
print("ie.extract(): {}".format(ie.extract()))
images_absolute_url.append(urlparse.urljoin(response.url, ie.extract().strip()))
print("images_absolute_url: {}".format(images_absolute_url))
item['image_urls'] = images_absolute_url
yield item
Upvotes: 1
Views: 375
Reputation: 23856
This is because you are sending the same instance of the item (the item = TogItem()
you create on parse_chapter
) for all the pages.
One way to fix this would be to create the item inside the for loop:
def parse_chapter(self, response):
chaptertitle = response.xpath('.//*[@id="chapter_num"]/text()').extract()
pages = int(response.xpath('.//*[@id="head"]/span/text()').extract()[0])
for p in range(1, pages + 1):
item = TogItem(chaptertitle=chaptertitle)
page_absolute_url = urlparse.urljoin(response.url, str(p) + '.html')
yield Request(page_absolute_url, meta={'item': item},
callback=self.parse_pages, dont_filter=True)
Upvotes: 3