Reputation: 2202
I'm trying to write my first scrapy spider, Ive been following the tutorial at http://doc.scrapy.org/en/latest/intro/tutorial.html But I'm getting an error "KeyError: 'Spider not found: "
I think I'm running the command from the correct directory (the one with the scrapy.cfg file)
(proscraper)#( 10/14/14@ 2:06pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
tree
.
├── scrapy
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── juno_spider.py
└── scrapy.cfg
2 directories, 7 files
(proscraper)#( 10/14/14@ 2:13pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
ls
scrapy scrapy.cfg
Here is the error I'm getting
(proscraper)#( 10/14/14@ 2:13pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
scrapy crawl juno
/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification. Many valid certificate/hostname mappings may be rejected.
verifyHostname, VerificationError = _selectVerifyImplementation()
Traceback (most recent call last):
File "/home/tim/.virtualenvs/proscraper/bin/scrapy", line 9, in <module>
load_entry_point('Scrapy==0.24.4', 'console_scripts', 'scrapy')()
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 58, in run
spider = crawler.spiders.create(spname, **opts.spargs)
File "/home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/spidermanager.py", line 44, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: juno'
This is my virtualenv:
(proscraper)#( 10/14/14@ 2:13pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
pip freeze
Scrapy==0.24.4
Twisted==14.0.2
cffi==0.8.6
cryptography==0.6
cssselect==0.9.1
ipdb==0.8
ipython==2.3.0
lxml==3.4.0
pyOpenSSL==0.14
pycparser==2.10
queuelib==1.2.2
six==1.8.0
w3lib==1.10.0
wsgiref==0.1.2
zope.interface==4.1.1
Here is the code for my spider wth the name attribute filled in:
(proscraper)#( 10/14/14@ 2:14pm )( tim@localhost ):~/Workspace/Development/hacks/prosum-scraper/scrapy
cat scrapy/spiders/juno_spider.py
import scrapy
class JunoSpider(scrapy.Spider):
name = "juno"
allowed_domains = ["http://www.juno.co.uk/"]
start_urls = [
"http://www.juno.co.uk/dj-equipment/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
Upvotes: 7
Views: 10765
Reputation: 13798
When you start a project with scrapy as the project name it creates the directory structure you printed:
.
├── scrapy
│ ├── __init__.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ ├── __init__.py
│ └── juno_spider.py
└── scrapy.cfg
But using scrapy as the project name has a collateral effect. If you open the generated scrapy.cfg
you will see that your default settings points to your scrapy.settings
module.
[settings]
default = scrapy.settings
When we cat the scrapy.settings
file we see:
BOT_NAME = 'scrapy'
SPIDER_MODULES = ['scrapy.spiders']
NEWSPIDER_MODULE = 'scrapy.spiders'
Well, nothing strange here. The bot name, the list of modules where Scrapy will look for spiders, and the module where to create new spiders using the genspider command. So far, so good.
Now let's check the scrapy library. It has been properly installed under your proscraper isolated virtualenv under the /home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy
directory. Remember that site-packages
is always added to the sys.path
, that contains all the paths from where Python is going to search for the modules. So, guess what... the scrapy library also has a settings
module /home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/settings
that imports /home/tim/.virtualenvs/proscraper/lib/python2.7/site-packages/scrapy/settings/default_settings.py
that holds the default values for all the settings. Special attention to the default SPIDER_MODULES
entry:
SPIDER_MODULES = []
Maybe you are starting to get what is happening. Choosing scrapy as the project name also generated a scrapy.settings
module that clashes with the scrapy library scrapy.settings
. And here is where the order in how the corresponding paths were inserted in sys.path
will make Python to import one or the other. First to appear wins. In this case the scrapy library settings wins. And hence the KeyError: 'Spider not found: juno'
.
To solve this conflict you could rename your project folder to another name, let's say scrap
:
.
├── scrap
│ ├── __init__.py
Modify your scrapy.cfg
to point to the proper settings
module:
[settings]
default = scrap.settings
And update your scrap.settings
to point to the proper spiders:
SPIDER_MODULES = ['scrap.spiders']
But as @paultrmbrth suggested I would recreate the project with another name.
Upvotes: 10