Reputation: 13
I've been running into an issue with a spider I've put together. I am trying to scrape individual lines from the transcript on this site, and have found some appropriate selectors, but when run, the spider's output is simply the same line repeated over and over. I've seen a couple others with similar issues (like this), but haven't yet found an answer that solves my problem.
(As a note, I believe this may be an issue with my base Python coding and for
loop building, as opposed to an issue with scrapy
itself.)
Here is the spider:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class TalSpider(CrawlSpider):
name = 'tal'
allowed_domains = ['https://www.thisamericanlife.org/radio-archives/episode/']
start_urls = ['https://www.thisamericanlife.org/radio-archives/episode/1/transcript/']
def parse(self, response):
for line in response.xpath('//div'):
episode_num_text = line.xpath('//div[contains(@class, "radio-wrapper")]/@id').extract()
radio_date_text = line.xpath('//div[contains(@class, "radio-date")]/text()').extract()
episode_title = line.xpath('//h2').xpath('a[contains(@href, *)]/text()').extract()
begin_timestamp = line.xpath('//p[contains(@begin, *)]/@begin').extract()
speaker_class = line.xpath('//div/@class').extract()
speaker_name = line.xpath('//h4/text()').extract()
line_text = line.xpath('//p[contains(@begin, *)]/text()').extract()
full_audio_link = line.xpath('//p[contains(@class, "full-audio")]/text()').extract()
for item in zip(episode_num_text, radio_date_text, episode_title, begin_timestamp, speaker_class, speaker_name, line_text, full_audio_link):
scraped_info = {
'episode_num_text' : item[0],
'radio_date_text' : item[1],
'episode_title' : item[2],
'begin_timestamp' : item[3],
'speaker_class' : item[4],
'speaker_name' : item[5],
'line_text' : item[6],
'full_audio_link' : item[7],
}
yield scraped_info
And here is a screen grab of the .csv output which shows the repeated output.
The issue seems to lie in the for
loop. My thinking is this: for each Selector in this list of Selectors, pull a subset that element as defined by the items in the for loop. Instead, it seems to be executing: for each of the 177 Selectors in this list, return the first element of each of the items defined.
I'm happy to clarify any of these issues, and would greatly appreciate any help anyone can offer!
Upvotes: 1
Views: 577
Reputation: 2011
Please be aware of the absolute XPath versus relative XPath in the Scrapy.
When parsing, you are looping over the elements parsed from an absolute XPath. However, inside the loop, you are still using absolute XPath, which is wrong and should be relative XPath.
Thanks.
Upvotes: 4