Reputation: 745
I have been trying to make an app in Python using Scrapy
that has the following functionality:
I am able to do this using the following code:
items = []
def add_item(item):
items.append(item)
# set up crawler
crawler = Crawler(SpiderClass,settings=get_project_settings())
crawler.signals.connect(add_item, signal=signals.item_passed)
# This is added to make the reactor stop, if I don't use this, the code stucks at reactor.run() line.
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) #@UndefinedVariable
crawler.crawl(requestParams=requestParams)
# start crawling
reactor.run() #@UndefinedVariable
return str(items)
Now the problem I am facing is after making the reactor stop (which seems necessary to me since I don't want to stuck to the reactor.run()
). I couldn't accept the further request after first request. After first request gets completed, I got the following error:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\flask\app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "c:\python27\lib\site-packages\flask\app.py", line 1641, in full_dispatch_request
rv = self.handle_user_exception(e)
File "c:\python27\lib\site-packages\flask\app.py", line 1544, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "c:\python27\lib\site-packages\flask\app.py", line 1639, in full_dispatch_request
rv = self.dispatch_request()
File "c:\python27\lib\site-packages\flask\app.py", line 1625, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "F:\my_workspace\jobvite\jobvite\com\jobvite\web\RequestListener.py", line 38, in submitForm
reactor.run() #@UndefinedVariable
File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "c:\python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable
Which is obvious, since we can not restart the reactor.
So my questions are:
1) How could I provide support for the next requests to crawl?
2) Is there any way to move to next line after reactor.run() without stopping it?
Upvotes: 8
Views: 1003
Reputation: 346
Here is a simple solution to your problem
from flask import Flask
import threading
import subprocess
import sys
app = Flask(__name__)
class myThread (threading.Thread):
def __init__(self,target):
threading.Thread.__init__(self)
self.target = target
def run(self):
start_crawl()
def start_crawl():
pid = subprocess.Popen([sys.executable, "start_request.py"])
return
@app.route("/crawler/start")
def start_req():
print ":request"
threadObj = myThread("run_crawler")
threadObj.start()
return "Your crawler is in running state"
if (__name__ == "__main__"):
app.run(port = 5000)
In the above solution I assume that you are able to start your crawler from command line using command start_request.py file on shell/command line.
Now what we are doing is using threading in python to launch a new thread for each incoming request. Now you can easily able to run your crawler instance in parallel for each hit. Just control your number of threads using threading.activeCount()
Upvotes: 1
Reputation: 9246
I recommend you using a queue system like Rq (for simplicity, but there are few others).
You could have a craw function:
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from spiders import MySpider
def runCrawler(url, keys, mode, outside, uniqueid):
runner = CrawlerRunner( get_project_settings() )
d = runner.crawl( MySpider, url=url, param1=value1, ... )
d.addBoth(lambda _: reactor.stop())
reactor.run()
Then in your main code, use the Rq queue in order to collect crawler executions:
# other imports
pool = redis.ConnectionPool( host=REDIS_HOST, port=REDIS_PORT, db=your_redis_db_number)
redis_conn =redis.Redis(connection_pool=pool)
q = Queue('parse', connection=redis_conn)
# urlSet is a list of http:// or https:// like url's
for url in urlSet:
job = q.enqueue(runCrawler, url, param1, ... , timeout=600 )
Do not forget to start a rq worker process, working for the same queue name (here parse). For example, execute in a terminal session:
rq worker parse
Upvotes: 1