Reputation: 2684
I am new to Scrapy, trying to extract subtitles from www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=bojack-horseman-2014
This is my scrape.py
code which is Spider file
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re
ss_base_url = "https://www.springfieldspringfield.co.uk/episode_scripts.php"
class Script(Item):
url = Field()
episode_name = Field()
script = Field()
class SubtitleSpider(CrawlSpider):
name = "scrape"
allowed_domains = ['www.springfieldspringfield.co.uk']
start_urls = [ss_base_url]
rules = (
Rule(LinkExtractor(allow=['/episode_scripts.php?tv-show=bojack-horseman-2014&episode=\w+']),
callback="parse_script",
follow=True),)
def fix_field_names(self, field_name):
field_name = re.sub(" ","_", field_name)
field_name = re.sub(":","", field_name)
return field_name
def parse_script(self, response):
x = HtmlXPathSelector(response)
script = Script()
script['url'] = response.url
script['episode_name'] = "".join(x.select("//h3/text()").extract())
script['script'] = "\n".join(x.select("//div[@class='episode_script']/text()").extract())
return script
I am trying to extract all seasons subtitles from https://www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=bojack-horseman-2014
Subtitles are present inside these links
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s01e01
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s01e02
when I run
scrapy crawl --nolog scrape
I should get those above links as output. But its not returning nothing , where am I going wrong?
Upvotes: 0
Views: 321
Reputation: 1517
Your regular expression for matching the links contains a question mark wich needs to be escaped for your match to work, it should work if you change it to this:
'\/view_episode_scripts\.php\?tv-show=bojack-horseman-2014&episode=\w+'
When you run the script with --nolog it would not log the links, so you would need to remove that as well.
Upvotes: 1