Reputation: 2393
I found that my CrawlSpider
only crawls start_urls
, and not going any further.
The following is my code.
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class ExampleSpider(CrawlSpider):
name = 'example'
allowed_domains = ['holy-bible-eng']
start_urls = ['file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml']
rules = (
Rule(LinkExtractor(allow=r'OEBPS'), callback='parse_item', follow=True),
)
def parse_item(self, response):
return response
Below is my file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml
in start_urls
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Holy Bible</title><link href="lds_ePub_scriptures.css" rel="stylesheet" type="text/css" /></head><body class="bible-toc"><div class="titleBlock"><h1 class="toc-title">The Names and Order of All the <br /><span class="dominant">Books of the Old and <br />New Testaments</span></h1></div><div class="bible-toc"><p><a href="bible_dedication.xhtml">Epistle Dedicatory</a> | <a href="quad_abbreviations.xhtml">Abbreviations</a></p><h2 class="toc-title"><a href="ot.xhtml">The Books of the Old Testament</a></h2><p><a href="gen.xhtml">Genesis</a> | <a href="ex.xhtml">Exodus</a> | <a href="lev.xhtml">Leviticus</a> | <a href="num.xhtml">Numbers</a> | <a href="deut.xhtml">Deuteronomy</a> | <a href="josh.xhtml">Joshua</a> | <a href="judg.xhtml">Judges</a> | <a href="ruth.xhtml">Ruth</a> | <a href="1-sam.xhtml">1 Samuel</a> | <a href="2-sam.xhtml">2 Samuel</a> | <a href="1-kgs.xhtml">1 Kings</a> | <a href="2-kgs.xhtml">2 Kings</a> | <a href="1-chr.xhtml">1 Chronicles</a> | <a href="2-chr.xhtml">2 Chronicles</a> | <a href="ezra.xhtml">Ezra</a> | <a href="neh.xhtml">Nehemiah</a> | <a href="esth.xhtml">Esther</a> | <a href="job.xhtml">Job</a> | <a href="ps.xhtml">Psalms</a> | <a href="prov.xhtml">Proverbs</a> | <a href="eccl.xhtml">Ecclesiastes</a> | <a href="song.xhtml">Song of Solomon</a> | <a href="isa.xhtml">Isaiah</a> | <a href="jer.xhtml">Jeremiah</a> | <a href="lam.xhtml">Lamentations</a> | <a href="ezek.xhtml">Ezekiel</a> | <a href="dan.xhtml">Daniel</a> | <a href="hosea.xhtml">Hosea</a> | <a href="joel.xhtml">Joel</a> | <a href="amos.xhtml">Amos</a> | <a href="obad.xhtml">Obadiah</a> | <a href="jonah.xhtml">Jonah</a> | <a href="micah.xhtml">Micah</a> | <a href="nahum.xhtml">Nahum</a> | <a href="hab.xhtml">Habakkuk</a> | <a href="zeph.xhtml">Zephaniah</a> | <a href="hag.xhtml">Haggai</a> | <a href="zech.xhtml">Zechariah</a> | <a href="mal.xhtml">Malachi</a></p><h2 class="toc-title"><a href="nt.xhtml">The Books of the New Testament</a></h2><p><a href="matt.xhtml">Matthew</a> | <a href="mark.xhtml">Mark</a> | <a href="luke.xhtml">Luke</a> | <a href="john.xhtml">John</a> | <a href="acts.xhtml">Acts</a> | <a href="rom.xhtml">Romans</a> | <a href="1-cor.xhtml">1 Corinthians</a> | <a href="2-cor.xhtml">2 Corinthians</a> | <a href="gal.xhtml">Galatians</a> | <a href="eph.xhtml">Ephesians</a> | <a href="philip.xhtml">Philippians</a> | <a href="col.xhtml">Colossians</a> | <a href="1-thes.xhtml">1 Thessalonians</a> | <a href="2-thes.xhtml">2 Thessalonians</a> | <a href="1-tim.xhtml">1 Timothy</a> | <a href="2-tim.xhtml">2 Timothy</a> | <a href="titus.xhtml">Titus</a> | <a href="philem.xhtml">Philemon</a> | <a href="heb.xhtml">Hebrews</a> | <a href="james.xhtml">James</a> | <a href="1-pet.xhtml">1 Peter</a> | <a href="2-pet.xhtml">2 Peter</a> | <a href="1-jn.xhtml">1 John</a> | <a href="2-jn.xhtml">2 John</a> | <a href="3-jn.xhtml">3 John</a> | <a href="jude.xhtml">Jude</a> | <a href="rev.xhtml">Revelation</a></p><h2 class="toc-title"><a href="bible-helps_title-page.xhtml">Appendix</a></h2><p><a href="tg.xhtml">Topical Guide</a> | <a href="bd.xhtml">Bible Dictionary</a> | <a href="bible-chron.xhtml">Bible Chronology</a> | <a href="harmony.xhtml">Harmony of the Gospels</a> | <a href="jst.xhtml">Joseph Smith Translation</a> | <a href="bible-maps.xhtml">Bible Maps</a> | <a href="bible-photos.xhtml">Bible Photographs</a></p></div></body></html>
And the below is my console output.
(crawl) G:\kjvbible>scrapy crawl example
......
......
2017-04-08 09:24:59 [scrapy.core.engine] INFO: Spider opened
2017-04-08 09:24:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-08 09:24:59 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-04-08 09:24:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml> (referer: None)
2017-04-08 09:24:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-08 09:24:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 237,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 3693,
It doesn't go any deeper.
Any suggestions would be welcome.
Upvotes: 1
Views: 159
Reputation: 18799
from CrawlSpider
documentation:
follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False
you cannot have a rule with callback
and follow=True
at the same time. It will only listen to the callback, and it won't go further.
So the main idea behind CrawlSpider
's rules is that it can find links to follow and links to actually extract.
Now scrapy
isn't the best idea to check your "local" files, for that just create a simple script.
Another error is that you are setting the allowed_domains
class variable, which specifies which domains it should accept. All the others are rejected, and this only works for links on the internet. Remove that variable if you don't want to reject domains, or if you are not using domains at all (your case).
Upvotes: 2