Reputation: 18735
I have a Scrapy
spider which I run every hour using bash
script and crontab
.
The running time of the spider is about 50 minutes but can be more than hour.
What I want is to check whether the spider is running and only if not, start new crawling.
BASH SCRIPT
#!/usr/bin/env bash
source /home/milano/.virtualenvs/keywords_search/bin/activate
cd /home/milano/PycharmProjects/keywords_search/bot
# HERE I WANT TO CHECK, WHETHER THE PREVIOUS CRAWLING ALREADY STOPPED, IF NOT, DO NOTHING
scrapy crawl main_spider
The only thing which comes to my mind is to use telnet
.
If it can connect - telnet localhost 6023
, it means that spider is still running otherwise I can run spider.
Upvotes: 3
Views: 1108
Reputation: 4712
You need some sort of locking mechanism.
The best way to achieve an atomic lock from bash is to use mkdir and check the result code to know if you acquired the lock or not.
Here's a more in depth explanation: http://wiki.bash-hackers.org/howto/mutex
Of course you could always go for dirtier methods like a grep on process names or stuff like that.
You could also have a lock in scrapy itself, add a simple middleware check for a shared resource... Plenty of ways to do it :)
Upvotes: 1