Chris Jewell
Chris Jewell

Reputation: 13

Scrapy Spider Returning Same Elements Over and Over

I've been running into an issue with a spider I've put together. I am trying to scrape individual lines from the transcript on this site, and have found some appropriate selectors, but when run, the spider's output is simply the same line repeated over and over. I've seen a couple others with similar issues (like this), but haven't yet found an answer that solves my problem.

(As a note, I believe this may be an issue with my base Python coding and for loop building, as opposed to an issue with scrapy itself.)

Here is the spider:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class TalSpider(CrawlSpider):
    name = 'tal'
    allowed_domains = ['https://www.thisamericanlife.org/radio-archives/episode/']
    start_urls = ['https://www.thisamericanlife.org/radio-archives/episode/1/transcript/']

def parse(self, response):

    for line in response.xpath('//div'):
        episode_num_text = line.xpath('//div[contains(@class, "radio-wrapper")]/@id').extract()
        radio_date_text = line.xpath('//div[contains(@class, "radio-date")]/text()').extract()
        episode_title = line.xpath('//h2').xpath('a[contains(@href, *)]/text()').extract()
        begin_timestamp = line.xpath('//p[contains(@begin, *)]/@begin').extract()
        speaker_class = line.xpath('//div/@class').extract()
        speaker_name = line.xpath('//h4/text()').extract()
        line_text = line.xpath('//p[contains(@begin, *)]/text()').extract()
        full_audio_link = line.xpath('//p[contains(@class, "full-audio")]/text()').extract()



        for item in zip(episode_num_text, radio_date_text, episode_title, begin_timestamp, speaker_class, speaker_name, line_text, full_audio_link):
            scraped_info = {
                'episode_num_text' : item[0], 
                'radio_date_text' : item[1], 
                'episode_title' : item[2],
                'begin_timestamp' : item[3], 
                'speaker_class' : item[4],
                'speaker_name' : item[5], 
                'line_text' : item[6], 
                'full_audio_link' : item[7],
                }
            yield scraped_info

And here is a screen grab of the .csv output which shows the repeated output.

The issue seems to lie in the for loop. My thinking is this: for each Selector in this list of Selectors, pull a subset that element as defined by the items in the for loop. Instead, it seems to be executing: for each of the 177 Selectors in this list, return the first element of each of the items defined.

I'm happy to clarify any of these issues, and would greatly appreciate any help anyone can offer!

Upvotes: 1

Views: 577

Answers (1)

rojeeer
rojeeer

Reputation: 2011

Please be aware of the absolute XPath versus relative XPath in the Scrapy.

When parsing, you are looping over the elements parsed from an absolute XPath. However, inside the loop, you are still using absolute XPath, which is wrong and should be relative XPath.

Thanks.

Upvotes: 4

Related Questions