MrRobot9
MrRobot9

Reputation: 2684

Scrapy: Extract links

I am new to Scrapy, trying to extract subtitles from www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=bojack-horseman-2014

This is my scrape.py code which is Spider file

 from scrapy.spiders import CrawlSpider, Rule
 from scrapy.linkextractors import LinkExtractor
 from scrapy.selector import HtmlXPathSelector
 from scrapy.selector import HtmlXPathSelector
 from scrapy.item import Item, Field
 import re

ss_base_url = "https://www.springfieldspringfield.co.uk/episode_scripts.php"

class Script(Item):
    url = Field()
    episode_name = Field()
    script = Field()

class SubtitleSpider(CrawlSpider):
    name = "scrape"
    allowed_domains = ['www.springfieldspringfield.co.uk']
    start_urls = [ss_base_url]
    rules = (
        Rule(LinkExtractor(allow=['/episode_scripts.php?tv-show=bojack-horseman-2014&episode=\w+']),
             callback="parse_script",
             follow=True),)

    def fix_field_names(self, field_name):
        field_name = re.sub(" ","_", field_name)
        field_name = re.sub(":","", field_name)
        return field_name

    def parse_script(self, response):
        x = HtmlXPathSelector(response)
        script = Script()
        script['url'] = response.url
        script['episode_name'] = "".join(x.select("//h3/text()").extract())
        script['script'] = "\n".join(x.select("//div[@class='episode_script']/text()").extract())
        return script

I am trying to extract all seasons subtitles from https://www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=bojack-horseman-2014

Subtitles are present inside these links

https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s01e01

https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s01e02

when I run

 scrapy crawl --nolog scrape

I should get those above links as output. But its not returning nothing , where am I going wrong?

Upvotes: 0

Views: 321

Answers (1)

Lasse Sviland
Lasse Sviland

Reputation: 1517

Your regular expression for matching the links contains a question mark wich needs to be escaped for your match to work, it should work if you change it to this:

'\/view_episode_scripts\.php\?tv-show=bojack-horseman-2014&episode=\w+'

When you run the script with --nolog it would not log the links, so you would need to remove that as well.

Upvotes: 1

Related Questions