Arctelix
Arctelix

Reputation: 4576

How to get celery working with scrapy server on heroku and django-dynamic-scraper?

I am in the process of building my first project incorporating scrapy. Everything works well on my development server (windows), but have a few issues on heroku. I am using django-dynamic-scraper which handled allot of the integration work for me.

On windows i run the following commands in separate command prompts:

: scrapy server
: python manage.py celeryd -l info
: python manage.py celerybeat

On heroku I run the following:

: heroku bash >heroku run scrappy server (solves app not found issue)
: heroku run python manage.py celeryd -l info -B --settings=myapp.production

The actual dejango app has no errors or issues and i can access the admin website. scrappy server runs:

: Scrapyd web console available at http://0.0.0.0:6800/
: [Launcher] Scrapyd started: max_proc=16, runner='scrapyd.runner'
: Site starting on 6800
: Starting factory <twisted.web.server.Site instanceat 0x7f1511f62ab8>

and celery beat and worker are working:

: INFO/Beat] beat: Starting...
: INFO/Beat] Writing entries...
: INFO/MainProcess] Connected to django://guest:**@localhost:5672//
: WARNING/MainProcess] celery@081b4100-eb7f-441c-976d-ecf97d2d7e5a ready.
: INFO/Beat] Writing entries...
: INFO/Beat] Writing entries...

FIRST ISSUE: When the periodic task to run the spider is triggered i get the following error in the celery log.

    File "/app/.heroku/python/lib/python2.7/site-packages/dynamic_scraper/utils/ta
    sk_utils.py", line 31, in _pending_jobs
        resp = urllib2.urlopen('http://localhost:6800/listjobs.json?project=default')
    ...
    ...

    File "/app/.heroku/python/lib/python2.7/urllib2.py", line 1184, in do_open
        raise URLError(err)
    URLError: <urlopen error [Errno 111] Connection refused>

So it seems that for some reason heroku is not allowing celery to access the scrapy server.

Here are some of my settings:

scrapy.cfg

[settings]
default = myapp.scraper.scrape.settings

[deploy]
#url = http://localhost:6800/
project = myapp

celery config

[config]
    app:         default:0x7fd4983f6310 (djcelery.loaders.DjangoL
    transport:   django://guest:**@localhost:5672//
    results:     database
    concurrency: 4 (prefork)
[queues]
    celery       exchange=celery(direct) key=celery

Thanks in advance and let me know if you need any more info.

Upvotes: 1

Views: 963

Answers (1)

Arctelix
Arctelix

Reputation: 4576

The answer is: you can't run your web app, celery, and scrapy server on the same host and allow them to talk to each other. However, there are two ways to accomplish this setup with heroku.

Option 1:

  1. Use scrapy-heroku to deploy your scrapy server to a host called "myapp-scrapy.herokuapp.com".
  2. Then deploy your django-scrapy app to another host called "myapp.herokuapp.com".
  3. In django-dynamic-scraper open task_utls.py and change all occurrences of localhost:6800 to myapp-scrapy.herokuapp.com.

Option 2:

  1. Simply use heroku's scheduler to call your scrapers manually as you would on the command line. You will be bypassing all of the dynamic scheduling features, but for some use cases that's just fine.

I hope this helps somebody save some pain.

Upvotes: 1

Related Questions