Bert Carremans
Bert Carremans

Reputation: 1733

For loop in Scrapy returns full response multiple times

I am trying to scrape a webpage containing a tv-guide (movies with their channel and starting time). The structure of the webpage looks like this:

<div class="grid__col__inner">
    <div class="tv-guide__channel">
        <h6>
            <a href="./tv-gids/2be/vandaag">2BE</a>
        </h6>
    </div>
    <div class="program">
        <div class="time">22:20</div>
        <div class="title"><a href="./2be/vandaag/knowing">Knowing</a></div>
    </div>
</div>

The webpage has multiple grid__col__inner divs. One for each channel. Each channel can contain multiple movies.

I wrote a spider with the Scrapy framework as follows:

    def parse(self, response):
        for col_inner in response.xpath('//div[@class="grid__col__inner"]'):
            chnl = col_inner.xpath('//div[@class="tv-guide__channel"]/h6/a/text()').extract()
            for program in col_inner.xpath('//div[@class="program"]'):
                item = TVGuideItem()
                item['channel'] = chnl
                item['start_ts'] = program.xpath('//div[@class="time"]/text()').extract()
                item['title'] = program.xpath('//div[@class="title"]/a/text()').extract()
                yield item

Because the channel name is only mentioned once in the grid__col__inner div, I extract it first and assign it to each item (movie).

When I run this code, it returns the full result (all channels with all movies) for each grid__col__inner. Below you see the result of one run of the for-loop. When I run it, it returns this same result multiple times.

{'channel': [u'VTM', u'VITAYA', u'PRIME STAR', u'PRIME ACTION', u'PRIME FAMILY', u'PRIME FEZZTIVAL', u'NPO3'], 'start_ts': [u'22:30', u'13:35', u'20:35', u'06:30', u'08:00', u'09:40', u'11:00'], 'title': [u'Another 48 Hrs', u'Double Bill', u'Man zkt Vrouw', u'82 dagen in april', u'Rio 2', u'Epizoda u zivotu beraca zeljeza', u'300: Rise of an Empire']}

Am I doing something wrong with the for-loop here?

Upvotes: 0

Views: 602

Answers (1)

bertucho
bertucho

Reputation: 26

Read this documentation from scrapy: http://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths

When you are doing:

chnl = col_inner.xpath('//div[@class="tv-guide__channel"]/h6/a/text()').extract()

You are extracting all the //div[@class="tv-guide__channel"] elements in the document, because // is searching across all the document. Instead try this:

chnl = col_inner.xpath('.//div[@class="tv-guide__channel"]/h6/a/text()').extract()

the .// will execute the search relative to the current node. You have to do the same with the rest of the selectors:

    def parse(self, response):
    for col_inner in response.xpath('//div[@class="grid__col__inner"]'):
        chnl = col_inner.xpath('.//div[@class="tv-guide__channel"]/h6/a/text()').extract()
        for program in col_inner.xpath('.//div[@class="program"]'):
            item = TVGuideItem()
            item['channel'] = chnl
            item['start_ts'] = program.xpath('.//div[@class="time"]/text()').extract()
            item['title'] = program.xpath('.//div[@class="title"]/a/text()').extract()
            yield item

Read this documentation from scrapy: http://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths

Upvotes: 1

Related Questions