Robert Mark Bram
Robert Mark Bram

Reputation: 9725

Cannot have two spiders in the same project?

I was able to generate the first spider ok

Thu Feb 27 - 01:59 PM > scrapy genspider confluenceChildPages confluence
Created spider 'confluenceChildPages' using template 'crawl' in module:
  dirbot.spiders.confluenceChildPages

But when I attempted to generate another spider, I got this:

Thu Feb 27 - 01:59 PM > scrapy genspider xxx confluence
Traceback (most recent call last):
  File "/usr/bin/scrapy", line 5, in <module>
    pkg_resources.run_script('Scrapy==0.22.2', 'scrapy')
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 505, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/site-packages/pkg_resources.py", line 1245, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
    execute()
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/cmdline.py", line 143, in execute
    _run_print_help(parser, _run_command, cmd, args, opts)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/cmdline.py", line 89, in _run_print_help
    func(*a, **kw)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
    cmd.run(args, opts)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/commands/genspider.py", line 68, in run
    crawler = self.crawler_process.create_crawler()
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/crawler.py", line 87, in create_crawler
    self.crawlers[name] = Crawler(self.settings)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/crawler.py", line 25, in __init__
    self.spiders = spman_cls.from_crawler(self)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/spidermanager.py", line 35, in from_crawler
    sm = cls.from_settings(crawler.settings)
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/spidermanager.py", line 31, in from_settings
    return cls(settings.getlist('SPIDER_MODULES'))
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/spidermanager.py", line 22, in __init__
    for module in walk_modules(name):
  File "/usr/lib/python2.7/site-packages/Scrapy-0.22.2-py2.7.egg/scrapy/utils/misc.py", line 68, in walk_modules
    submod = import_module(fullpath)
  File "/usr/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
  File "/d/Work/TollOnline/Notes/Issues/JIRA/TOL-821_Review_Toll_Online_Confluence_Pages/dirbot-master/dirbot/spiders/confluenceChildPages.py", line 4, in <module>
    from scrapybot.items import ScrapybotItem
ImportError: No module named scrapybot.items

Update: Thursday 27 February 2014, 07:35:24 PM - to add information @omair_77 was asking for.

I am using dirbot from https://github.com/scrapy/dirbot.

Initial directory structure is:

.
./.gitignore
./dirbot
./dirbot/items.py
./dirbot/pipelines.py
./dirbot/settings.py
./dirbot/spiders
./dirbot/spiders/dmoz.py
./dirbot/spiders/__init__.py
./dirbot/__init__.py
./README.rst
./scrapy.cfg
./setup.py

I then try to create two spiders:

scrapy genspider confluenceChildPagesWithTags confluence
scrapy genspider confluenceChildPages confluence

and I get the error on the second genspider command.


Update: Wednesday 5 March 2014, 02:16:07 PM - to add information in relation to @Darian's answer. Showing that scrapybot pops up only after the first genspider command.

Wed Mar 05 - 02:12 PM > find .
.
./.gitignore
./dirbot
./dirbot/items.py
./dirbot/pipelines.py
./dirbot/settings.py
./dirbot/spiders
./dirbot/spiders/dmoz.py
./dirbot/spiders/__init__.py
./dirbot/__init__.py
./README.rst
./scrapy.cfg
./setup.py
Wed Mar 05 - 02:13 PM > find . -type f -print0 | xargs -0 grep -i scrapybot
Wed Mar 05 - 02:14 PM > scrapy genspider confluenceChildPages confluence
Created spider 'confluenceChildPages' using template 'crawl' in module:
  dirbot.spiders.confluenceChildPages
Wed Mar 05 - 02:14 PM > find .
.
./.gitignore
./dirbot
./dirbot/items.py
./dirbot/items.pyc
./dirbot/pipelines.py
./dirbot/settings.py
./dirbot/settings.pyc
./dirbot/spiders
./dirbot/spiders/confluenceChildPages.py
./dirbot/spiders/dmoz.py
./dirbot/spiders/dmoz.pyc
./dirbot/spiders/__init__.py
./dirbot/spiders/__init__.pyc
./dirbot/__init__.py
./dirbot/__init__.pyc
./README.rst
./scrapy.cfg
./setup.py
Wed Mar 05 - 02:17 PM > find . -type f -print0 | xargs -0 grep -i scrapybot
./dirbot/spiders/confluenceChildPages.py:from scrapybot.items import ScrapybotItem
./dirbot/spiders/confluenceChildPages.py:        i = ScrapybotItem()

and that newly generated confluenceChildPages.py is:

from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapybot.items import ScrapybotItem

class ConfluencechildpagesSpider(CrawlSpider):
    name = 'confluenceChildPages'
    allowed_domains = ['confluence']
    start_urls = ['http://www.confluence/']

    rules = (
        Rule(SgmlLinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        sel = Selector(response)
        i = ScrapybotItem()
        #i['domain_id'] = sel.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = sel.xpath('//div[@id="name"]').extract()
        #i['description'] = sel.xpath('//div[@id="description"]').extract()
        return i

So I can see it references scrapybot, but I am not sure how to fix it.. very much a n00b still.

Upvotes: 0

Views: 535

Answers (2)

Darian Moody
Darian Moody

Reputation: 4084

You see this last line in the traceback:

 File "/d/Work/TollOnline/Notes/Issues/JIRA/TOL-821_Review_Toll_Online_Confluence_Pages/dirbot-master/dirbot/spiders/confluenceChildPages.py", line 4, in <module>
from scrapybot.items import ScrapybotItem

This tells me that the first spider you generated "confluenceChildPages" thinks it needs to import items from a module called scrapybot but that doesn't exist. If you look inside confluenceChildPages.py you'll be able to see that line which is causing the error.

I'm not actually sure which setting it uses to generate that off the top of my head but if you look (grep) for scrapybot within your project you should find where it is getting it from and then be able to change it to dirbot which looks like the module you want.

You will then need to delete the first spider it generated and re-generate it. It errors the 2nd time you create one because it loads in the first spider you generated as part of the project and as this has an import error in it, you get the traceback.

Cheers.

Upvotes: 1

Omair Shamshir
Omair Shamshir

Reputation: 2136

show your directory hierarchy for better solution . this problem occurs mostly when Your spider module is named the same as your scrapy project module, so python is trying to import items relative to spider. so make sure that your project module and spider module name is not same

Upvotes: 1

Related Questions