Chris Jewell
Chris Jewell

Reputation: 13

Scrapy Spider returning only last element when given a list of Selectors

I've been running into an issue with a spider I've put together. I am trying to scrape individual lines of text, along with their corresponding timestamps, from the transcript on this site, and have found what I believe are the appropriate selectors, but when run, the spider's output is just the last line and timestamp. I've seen a couple others with similar issues, but haven't yet found an answer that solves my problem.

Here is the spider:

# -*- coding: utf-8 -*-
import scrapy
from this_american_life.items import TalTranscriptItem

class CrawlSpider(scrapy.Spider):
    name = "transcript2"
    allowed_domains = ["https://www.thisamericanlife.org/radio-archives/episode/1/transcript"]
    start_urls = (
        'https://www.thisamericanlife.org/radio-archives/episode/1/transcript',
    )

    def parse(self, response):
        item = TalTranscriptItem()
        for line in response.xpath('//p'):
            item['begin_timestamp'] = line.xpath('//@begin').extract()
            item['line_text'] = line.xpath('//text()').extract()
        yield item

And here is the code for TalTranscriptItem() in items.py:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TalTranscriptItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    episode_id = scrapy.Field()
    episode_num_text = scrapy.Field()
    year = scrapy.Field()
    radio_date_text = scrapy.Field()
    radio_date_datetime = scrapy.Field()
    episode_title = scrapy.Field()
    episode_hosts = scrapy.Field()
    act_id = scrapy.Field()
    line_id = scrapy.Field()
    begin_timestamp = scrapy.Field()
    speaker_class = scrapy.Field()
    speaker_name = scrapy.Field()
    line_text = scrapy.Field()
    full_audio_link = scrapy.Field()
    transcript_url = scrapy.Field()

When run in the scrapy shell, it appears to work correctly (drawing all of the lines of text), but for some reason I haven't been able to get it to work in the spider.

I'm happy to clarify any of these issues, and would greatly appreciate any help anyone can offer!

Upvotes: 0

Views: 867

Answers (2)

Wilfredo
Wilfredo

Reputation: 1548

If you want each individual line yielded as an item I think this is what you want (notice the last indentation for the yield line):

for line in response.css('p'):
    item = TalTranscriptItem()
    item['begin_timestamp'] = line.xpath('./@begin').extract_first()
    item['line_text'] = line.xpath('./text()').extract_first()
    yield item

Upvotes: 1

Wandrille
Wandrille

Reputation: 6811

I don't know what item is but you can do:

item = []

for line in response.xpath('//p'):
   dictItem = {'begin_timestamp':line.xpath('//@begin').extract(),'line_text':line.xpath('//text()').extract()}
   item.append(dictItem)

print(item)

Upvotes: 0

Related Questions