Reputation: 37
We are trying to program a bot which claim to crawl articles from a newspaper thanks it's rss feeds. So we want that our script could repeat this steps several times per day:
1) look at the rss feeds we have listed
2) identify articles we haven't crawled yet
3) add the links to a list of urls to crawl
4) crawl the urls listed
We achieve to execute these steps one time with this code:
rss_feeds_lemonde = [
'http://www.lemonde.fr/rss/une.xml',
'http://www.lemonde.fr/international/rss_full.xml',
'http://www.lemonde.fr/politique/rss_full.xml',
]
db = sqlite3.connect('newspaper_db')
cursor = db.cursor()
urls = []
already_met = False
site = 'lemonde'
for rss_feed in rss_feeds_lemonde:
parsed_rss_feed = feedparser.parse(rss_feed)
for post in parsed_rss_feed.entries:
url = post.link
if url.split('.')[1] == site:
cursor.execute('''SELECT url FROM articles WHERE newspaper = site''')
rows = cursor.fetchall()
for row in rows:
if row[0] == url:
already_met = True
if already_met == False:
cursor.execute('''INSERT INTO articles(url, newspaper) VALUES(?,?)''', (url, site))
urls.append(url)
else:
already_met = False
cursor.close()
db.commit()
db.close()
if urls != []:
process = CrawlerProcess()
process.crawl(LeMondeSpider, start_urls = urls)
process.start()
The problem is that the twisted reactor is not restartable so it allows us to execute our steps once. Is it possible to pause the reactor and unpause it after we provide a new list of urls to crawl? Have we other solutions?
[edit] for notorious.no, this example works fine now thanks to you !
def run_when_crawl_done(null):
time.sleep(10)
urls = [
'http://www.lefigaro.fr/elections/presidentielles/2017/05/05/35003-20170505ARTFIG00129-comment-ils-veulent-bloquer-le-pen-sans-soutenir-macron-ce-dimanche.php',
'http://www.lefigaro.fr/elections/presidentielles/2017/05/04/35003-20170504ARTFIG00259-si-marine-le-pen-atteint-40-ca-serait-deja-une-enorme-victoire-dit-sa-niece.php',
'http://www.lefigaro.fr/elections/presidentielles/2017/05/04/35003-20170504ARTFIG00126-emmanuel-macron-non-je-n-ai-pas-de-compte-aux-bahamas.php',
]
deffered = runner.crawl(LeFigaroSpider, start_urls = urls)
deffered.addCallback(lambda _: reactor.stop())
urls = [
'http://www.lemonde.fr/les-decodeurs/article/2017/04/26/europe-macron-emploi-la-trumpisation-de-marine-le-pen-sur-tf1_5117479_4355770.html',
'http://www.lemonde.fr/syrie/article/2017/04/26/attaque-chimique-la-france-avance-ses-preuves-contre-damas_5117652_1618247.html',
]
if urls != []:
configure_logging()
runner = CrawlerRunner()
deferred = runner.crawl(LeMondeSpider, start_urls = urls)
deferred.addCallback(run_when_crawl_done)
reactor.run()
Upvotes: 2
Views: 1722
Reputation: 21436
If you really want to have a python loop running and acting as a crawling scheduler (which is not a very good idea generally) you should use subprocess
module to spawn some crawling process:
import subprocess
import time
while True:
subprocess.open('cd project && scrapy crawl spider')
time.sleep(60 * 30)
All of you sql logic should go in spider itself rather than execution script.
Upvotes: 1
Reputation: 5107
Twisted's reactor is indeed unable to restart. If you think about it for a minute, you will realize that stopping an event loop, only to have another event start it back up, is counter intuitive. Most event driven apps are "long running" and should not stop unless something is severely wrong.
Do not start-stop-restart event loops. Start the app and then never restart it (you're making a bot so I assume that the bot never sleeps). Use CrawlerRunner
instead of CrawlerProcess
then execute reactor.run()
. This allows a bit more flexibility and allows you to run more tasks concurrently.
def run_when_crawl_done(null):
"""
logic that will be executed after the crawl is done
"""
if urls:
runner = CrawlerRunner()
deferred = runner.crawl(LeMondeSpider, start_urls=urls)
deferred.addCallback(run_when_crawl_done)
reactor.run()
Upvotes: 2