Yura
Yura

Reputation: 2457

how to write endless loop crawler in python?

EDITED:

I have a crawler.py that crawls certain sites every 10 minutes and sends me some emails regarding these site. The crawler is ready and working locally.

How can I adjust it so that the following two things will happen :

  1. It will run in endless loop on the hosting that I'll upload it to?
  2. Sometimes I will be able to stop it ( e.g. for debugging).

At first, I thought of doing endless loop e.g.

crawler.py:

while True:
    doCarwling()
    sleep(10 minutes)

However, according to answers I got below, this would be impossible since hosting providers kill processes after a while (just for the question sake, let's assume proccesses are killed every 30 min). Therefore, my endless loop process would be killed at some point.

Therefore, I have thought pf a different solution: Lets assume that my crawler is located at "www.example.com\crawler.py" and each time it is accessed, it executes the function run():

run()
     doCarwling()
     sleep(10 minutes)
     call URL "www.example.com\crawler.py"

Thus, there will be no endless loop. In fact, every time my crawler runs, it would also access the URL which will execute the same crawler again. Therefore, there would be no endless loop, no process with a long-running time, and my crawler will continue operating forever.

Will my idea work? Are there any hidden drawbacks I haven't thought of?

Thanks!

Thanks

Upvotes: 0

Views: 899

Answers (3)

AkiRoss
AkiRoss

Reputation: 12273

As you stated in the comments, you are running on a public shared server like GoDaddy and so on. Therefore cron is not available there and long running scripts are usually forbidden - your process would be killed even if you were using sleep.

Therefore, the only solution I see is to use an external server on which you have to control to connect to your public server and run the script, every 10 minutes. One solution could be using cron on your local machine to connect with wget or curl to a specific page on your host. **

Maybe you can find on-line services that allow running a script periodically, and use those, but I know none.

** Bonus: you can get the results directly as response without having to send yourself an email.

Update

So, in your updated question you propose yo use your script to call itself with an HTTP request. I thought of it before, but I didn't consider it in my previous answer because I believe it won't work (in general).

My concern is: will the server kill a script if the HTTP connection requesting it is closed before the script terminates?

In other words: if you open yoursite.com/script.py and it takes 60 seconds to run, and you close the connection with the server after 10 seconds, will the script run till its regular end?

I thought that the answer was obviously "no, the script will be killed", therefore that method would be useless, because you should guarantee that a script calling itself via a HTTP request stays alive longer than the called script. I did a little experiment using flask, and it proved me wrong:

from flask import Flask
app = Flask(__name__)

@app.route('/')
def hello_world():
    import time
    print('Script started...')
    time.sleep(5)
    print('5 seconds passed...')
    time.sleep(5)
    print('Script finished')
    return 'Script finished'

if __name__ == '__main__':
    app.run()

If I run this script and make an HTTP request to localhost:5000, and close the connection after 2 seconds, the scripts continues to run until the end and the messages are still printed.

Therefore, with flask, if you can do an asynchronous request to yourself, you should be able to have an "infinite loop" script.

I don't know the behavior on other servers, though. You should make a test.

Control

Assuming your server allows you to do a GET request and have the script running even if the connection is closed, you have few things to take care of, for example that your script still has to run fast enough to complete during the maximum server time allowance, and that to make your script run every 10 minutes, with a maximum allowance of 1 minute, you have to count every time 10 calls.

In addition, this mechanism has to be controlled, because you cannot interrupt it for debug as you requested. At least, not directly.

Therefore, I suggest you to use files: use a file to split your crawling in smaller steps, each capable to finish in less than one minute, and then continue again when the script is called again.

Use a file to count how many times the script is called, before actually doing the crawling. This is necessary if, for example, the script is allowed to live 90 seconds, but you want to crawl every 10 hours.

Use a file to control the script: store a boolean flag that you use to stop the recursion mechanism if you need to.

Upvotes: 2

Maresh
Maresh

Reputation: 4712

If you are running linux I would setup and upstart script http://upstart.ubuntu.com/getting-started.html to turn it into a service. It offers a lot of advantages like: -Starting at system boot -Auto restart on crashes -Manageable: service mycrawler restart ...

Or if you would prefer to have it run every 10 minutes forget about the endless loop and do a cronjob http://en.wikipedia.org/wiki/Cron

Upvotes: 1

FBidu
FBidu

Reputation: 1012

If you're using Linux you should just do a cron job for your script. Info: http://code.tutsplus.com/tutorials/scheduling-tasks-with-cron-jobs--net-8800

Upvotes: 1

Related Questions