Scrapy iterating over list of elements on page

Question

I'm having issues with my scrapy project. I want to extract all adds on the page in a list and then iterate over that list to extract and save data for every add. I'm sure I'm doing something terribly wrong and yet I don't know what. I suspect the problem is with the .extract_first() command but I'm calling that on a single object in the list not the whole response. As of right now the spider is only extracting the first data that conforms to the xpath that it finds on the page. Here is the code:

class OddajastanovanjeljmestoSpider(scrapy.Spider):
    name = 'OddajaStanovanjeLjMesto'
    allowed_domains = ['www.nepremicnine.net']
    start_urls = ['https://www.nepremicnine.net/oglasi-oddaja/ljubljana-mesto/stanovanje/']

    def parse(self, response):
        oglasi = response.xpath('//div[@itemprop="item"]')
        for oglas in oglasi:
            item = NepremicninenetItem()
            item['velikost'] = oglas.xpath('//div[@class="main-data"]/span[@class="velikost"]/text()').extract_first(default="NaN")
            item['leto'] = oglas.xpath('//div[@class="atributi"]/span[@class="atribut leto"]/strong/text()').extract_first(default="NaN")
            item['zemljisce'] = oglas.xpath('//div[@class="atributi"]/span[@class="atribut"][text()="Zemljišče: "]/strong/text()').extract_first(default="NaN")

            request = scrapy.Request("https://www.nepremicnine.net" + response.xpath('//div[@itemprop="item"]/h2[@itemprop="name"]/a[@itemprop="url"]/@href').extract_first(), callback=self.parse_item_page)
            request.meta['item'] = item

            yield request

        next_page_url = response.xpath('//div[@id="pagination"]//a[@class="next"]/@href').extract_first()
        if next_page_url:
            absolute_next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(absolute_next_page_url)

    def parse_item_page(self, response):
        item = response.meta['item']

        item['referencnaStevilka'] = response.xpath('//div[@id="opis"]/div[@class="dsc"][preceding-sibling::div[@class="lbl"][text()="Referenčna št.:"]]/strong/text()').extract_first(default="NaN")
        item['tipOglasa'] = response.xpath('//li[@itemprop="itemListElement"]/a[../meta[@content="1"]]/@title').extract_first(default="NaN")
        item['cena'] = response.xpath('//div[@class="galerija-container"]/meta[@itemprop="price"]/@content').extract_first(default="NaN")
        item['valuta'] = response.xpath('//div[@class="galerija-container"]/meta[@itemprop="priceCurrency"]/@content').extract_first(default="NaN")
        item['vrstaNepremicnine'] = response.xpath('//li[@itemprop="itemListElement"]/a[../meta[@content="5"]]/@title').extract_first(default="NaN")
        item['tipNepremicnine'] = response.xpath('//li[@itemprop="itemListElement"]/a[../meta[@content="6"]]/@title').extract_first(default="NaN")
        item['regija'] = response.xpath('//li[@itemprop="itemListElement"]/a[../meta[@content="2"]]/@title').extract_first(default="NaN")
        item['upravnaEnota'] = response.xpath('//li[@itemprop="itemListElement"]/a[../meta[@content="3"]]/@title').extract_first(default="NaN")
        item['obcina'] = response.xpath('//li[@itemprop="itemListElement"]/a[../meta[@content="4"]]/@title').extract_first(default="NaN")
        item['prodajalec'] = response.xpath('//div[@itemprop="seller"]/meta[@itemprop="name"]/@content').extract_first(default="NaN")

        yield item

the parse_item_page method works correctly and returns the appropriate data but the parse method just returns the first data that it sees on the page...

Scrapy iterating over list of elements on page

Answers (1)

Related Questions