Reputation: 1733
I am trying to scrape a webpage containing a tv-guide (movies with their channel and starting time). The structure of the webpage looks like this:
<div class="grid__col__inner">
<div class="tv-guide__channel">
<h6>
<a href="./tv-gids/2be/vandaag">2BE</a>
</h6>
</div>
<div class="program">
<div class="time">22:20</div>
<div class="title"><a href="./2be/vandaag/knowing">Knowing</a></div>
</div>
</div>
The webpage has multiple grid__col__inner divs. One for each channel. Each channel can contain multiple movies.
I wrote a spider with the Scrapy framework as follows:
def parse(self, response):
for col_inner in response.xpath('//div[@class="grid__col__inner"]'):
chnl = col_inner.xpath('//div[@class="tv-guide__channel"]/h6/a/text()').extract()
for program in col_inner.xpath('//div[@class="program"]'):
item = TVGuideItem()
item['channel'] = chnl
item['start_ts'] = program.xpath('//div[@class="time"]/text()').extract()
item['title'] = program.xpath('//div[@class="title"]/a/text()').extract()
yield item
Because the channel name is only mentioned once in the grid__col__inner div, I extract it first and assign it to each item (movie).
When I run this code, it returns the full result (all channels with all movies) for each grid__col__inner. Below you see the result of one run of the for-loop. When I run it, it returns this same result multiple times.
{'channel': [u'VTM', u'VITAYA', u'PRIME STAR', u'PRIME ACTION', u'PRIME FAMILY', u'PRIME FEZZTIVAL', u'NPO3'], 'start_ts': [u'22:30', u'13:35', u'20:35', u'06:30', u'08:00', u'09:40', u'11:00'], 'title': [u'Another 48 Hrs', u'Double Bill', u'Man zkt Vrouw', u'82 dagen in april', u'Rio 2', u'Epizoda u zivotu beraca zeljeza', u'300: Rise of an Empire']}
Am I doing something wrong with the for-loop here?
Upvotes: 0
Views: 602
Reputation: 26
Read this documentation from scrapy: http://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths
When you are doing:
chnl = col_inner.xpath('//div[@class="tv-guide__channel"]/h6/a/text()').extract()
You are extracting all the //div[@class="tv-guide__channel"] elements in the document, because // is searching across all the document. Instead try this:
chnl = col_inner.xpath('.//div[@class="tv-guide__channel"]/h6/a/text()').extract()
the .// will execute the search relative to the current node. You have to do the same with the rest of the selectors:
def parse(self, response):
for col_inner in response.xpath('//div[@class="grid__col__inner"]'):
chnl = col_inner.xpath('.//div[@class="tv-guide__channel"]/h6/a/text()').extract()
for program in col_inner.xpath('.//div[@class="program"]'):
item = TVGuideItem()
item['channel'] = chnl
item['start_ts'] = program.xpath('.//div[@class="time"]/text()').extract()
item['title'] = program.xpath('.//div[@class="title"]/a/text()').extract()
yield item
Read this documentation from scrapy: http://doc.scrapy.org/en/latest/topics/selectors.html#working-with-relative-xpaths
Upvotes: 1