Scrapy crawl spider only touch start_urls

Question

I found that my CrawlSpider only crawls start_urls, and not going any further.

The following is my code.

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ExampleSpider(CrawlSpider):
    name = 'example'
    allowed_domains = ['holy-bible-eng']
    start_urls = ['file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml']

    rules = (
        Rule(LinkExtractor(allow=r'OEBPS'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        return response

Below is my file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml in start_urls



Holy BibleThe Names and Order of All the 
Books of the Old and 
New Testaments
Epistle Dedicatory | Abbreviations
The Books of the Old Testament
Genesis | Exodus | Leviticus | Numbers | Deuteronomy | Joshua | Judges | Ruth | 1 Samuel | 2 Samuel | 1 Kings | 2 Kings | 1 Chronicles | 2 Chronicles | Ezra | Nehemiah | Esther | Job | Psalms | Proverbs | Ecclesiastes | Song of Solomon | Isaiah | Jeremiah | Lamentations | Ezekiel | Daniel | Hosea | Joel | Amos | Obadiah | Jonah | Micah | Nahum | Habakkuk | Zephaniah | Haggai | Zechariah | Malachi
The Books of the New Testament
Matthew | Mark | Luke | John | Acts | Romans | 1 Corinthians | 2 Corinthians | Galatians | Ephesians | Philippians | Colossians | 1 Thessalonians | 2 Thessalonians | 1 Timothy | 2 Timothy | Titus | Philemon | Hebrews | James | 1 Peter | 2 Peter | 1 John | 2 John | 3 John | Jude | Revelation
Appendix
Topical Guide | Bible Dictionary | Bible Chronology | Harmony of the Gospels | Joseph Smith Translation | Bible Maps | Bible Photographs

And the below is my console output.

(crawl) G:\kjvbible>scrapy crawl example
......
......

2017-04-08 09:24:59 [scrapy.core.engine] INFO: Spider opened
2017-04-08 09:24:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-04-08 09:24:59 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026
2017-04-08 09:24:59 [scrapy.core.engine] DEBUG: Crawled (200)  (referer: None)
2017-04-08 09:24:59 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-08 09:24:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 237,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 3693,

It doesn't go any deeper.

Any suggestions would be welcome.

eLRuLL · Accepted Answer

from CrawlSpider documentation:

follow is a boolean which specifies if links should be followed from each response extracted with this rule. If callback is None follow defaults to True, otherwise it defaults to False

you cannot have a rule with callback and follow=True at the same time. It will only listen to the callback, and it won't go further.

So the main idea behind CrawlSpider's rules is that it can find links to follow and links to actually extract.

Now scrapy isn't the best idea to check your "local" files, for that just create a simple script.

Another error is that you are setting the allowed_domains class variable, which specifies which domains it should accept. All the others are rejected, and this only works for links on the internet. Remove that variable if you don't want to reject domains, or if you are not using domains at all (your case).

Scrapy crawl spider only touch start_urls

Answers (1)

Related Questions