Reputation: 13
I've been running into an issue with a spider I've put together. I am trying to scrape individual lines of text, along with their corresponding timestamps, from the transcript on this site, and have found what I believe are the appropriate selectors, but when run, the spider's output is just the last line and timestamp. I've seen a couple others with similar issues, but haven't yet found an answer that solves my problem.
Here is the spider:
# -*- coding: utf-8 -*-
import scrapy
from this_american_life.items import TalTranscriptItem
class CrawlSpider(scrapy.Spider):
name = "transcript2"
allowed_domains = ["https://www.thisamericanlife.org/radio-archives/episode/1/transcript"]
start_urls = (
'https://www.thisamericanlife.org/radio-archives/episode/1/transcript',
)
def parse(self, response):
item = TalTranscriptItem()
for line in response.xpath('//p'):
item['begin_timestamp'] = line.xpath('//@begin').extract()
item['line_text'] = line.xpath('//text()').extract()
yield item
And here is the code for TalTranscriptItem()
in items.py
:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TalTranscriptItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
episode_id = scrapy.Field()
episode_num_text = scrapy.Field()
year = scrapy.Field()
radio_date_text = scrapy.Field()
radio_date_datetime = scrapy.Field()
episode_title = scrapy.Field()
episode_hosts = scrapy.Field()
act_id = scrapy.Field()
line_id = scrapy.Field()
begin_timestamp = scrapy.Field()
speaker_class = scrapy.Field()
speaker_name = scrapy.Field()
line_text = scrapy.Field()
full_audio_link = scrapy.Field()
transcript_url = scrapy.Field()
When run in the scrapy shell
, it appears to work correctly (drawing all of the lines of text), but for some reason I haven't been able to get it to work in the spider.
I'm happy to clarify any of these issues, and would greatly appreciate any help anyone can offer!
Upvotes: 0
Views: 867
Reputation: 1548
If you want each individual line yielded as an item I think this is what you want (notice the last indentation for the yield
line):
for line in response.css('p'):
item = TalTranscriptItem()
item['begin_timestamp'] = line.xpath('./@begin').extract_first()
item['line_text'] = line.xpath('./text()').extract_first()
yield item
Upvotes: 1
Reputation: 6811
I don't know what item is but you can do:
item = []
for line in response.xpath('//p'):
dictItem = {'begin_timestamp':line.xpath('//@begin').extract(),'line_text':line.xpath('//text()').extract()}
item.append(dictItem)
print(item)
Upvotes: 0