euri10
euri10

Reputation: 2626

scrapy mixing items fields from different pages

unless I put CONCURRENT_REQUESTS_PER_DOMAIN=1 I end up having items from the pages I crawl taking attributes from other pages.

I suspect this might come from the fact I generate Requests "by hand" in the parse_chapter section, but I'm not sure and would like to understand how scrapy operates.

here is the relevant portion code

    rules = (
    Rule(LxmlLinkExtractor(allow_domains=allowed_domains,
                           restrict_xpaths='.//*[@id="page"]/table[2]/tbody/tr[10]/td[2]/a',
                           process_value=process_links), callback='parse_chapter'),
)


def parse_chapter(self, response):


    item = TogItem()
    item['chaptertitle'] = response.xpath('.//*[@id="chapter_num"]/text()').extract()

    pages = int(response.xpath('.//*[@id="head"]/span/text()').extract()[0])

    for p in range(1, pages + 1):
        page_absolute_url = urlparse.urljoin(response.url, str(p) + '.html')
        print("page_absolute_url: {}".format(page_absolute_url))
        **yield Request(page_absolute_url, meta={'item': item}, callback=self.parse_pages, dont_filter=True)**

def parse_pages(self, response):

    item = response.request.meta['item']
    item['pagenumber'] = response.xpath('.//*[@id="chapter_page"]/text()').extract()
    print(item['pagenumber'])
    images = response.xpath('//*[@id="image"]/@src')
    images_absolute_url = []
    for ie in images:
        print("ie.extract(): {}".format(ie.extract()))
        images_absolute_url.append(urlparse.urljoin(response.url, ie.extract().strip()))

    print("images_absolute_url: {}".format(images_absolute_url))

    item['image_urls'] = images_absolute_url
    yield item

Upvotes: 1

Views: 375

Answers (1)

Elias Dorneles
Elias Dorneles

Reputation: 23856

This is because you are sending the same instance of the item (the item = TogItem() you create on parse_chapter) for all the pages.

One way to fix this would be to create the item inside the for loop:

def parse_chapter(self, response):
    chaptertitle = response.xpath('.//*[@id="chapter_num"]/text()').extract()
    pages = int(response.xpath('.//*[@id="head"]/span/text()').extract()[0])

    for p in range(1, pages + 1):
        item = TogItem(chaptertitle=chaptertitle)

        page_absolute_url = urlparse.urljoin(response.url, str(p) + '.html')

        yield Request(page_absolute_url, meta={'item': item},
                      callback=self.parse_pages, dont_filter=True)

Upvotes: 3

Related Questions