Reputation: 727
I've written a couple of twitter scrapers in python, and am writing another script to keep them running even if they suffer a timeout, disconnection, etc.
My current solution is as follows:
Each scraper file has a doScrape/1 function in it, which will start up a scraper and run it once, eg:
def doScrape(logger):
try:
with DBWriter(logger=logger) as db:
logger.log_info("starting", __name__)
s = PastScraper(db.getKeywords(), TwitterAuth(), db, logger)
s.run()
finally:
logger.log_info("Done", __name__)
Where run is a near-infinite loop, which won't break unless there is an exception.
In order to run one of each kind of scraper at once, I'm using this code (with a few extra imports):
from threading import Thread
class ScraperThread(Thread):
def __init__(self, module, logger):
super(ScraperThread, self).__init__()
self.module = module # Module should contain a doScrape(logger) function
self.logger = logger
def run(self):
while True:
try:
print "Starting!"
print self.module.doScrape
self.module.doScrape(self.logger)
except: # if for any reason we get disconnected, reconnect
self.logger.log_debug("Restarting scraper", __name__)
if __name__ == "__main__":
with Logger(level="all", handle=open(sys.argv[1], "a")) as l:
past = ScraperThread(PastScraper, l)
stream = ScraperThread(StreamScraper, l)
past.start()
stream.start()
past.join()
stream.join()
However, it appears that my call of doScrape from above is returning immediately, hence "Starting!" is printed in the console repeatedly, and that "Done" message in the finally block is not written to the log, whereas when run individually like so:
if __name__ == "__main__":
# Example instantiation
from Scrapers.Logging import Logger
with Logger(level="all", handle=open(sys.argv[1], "a")) as l:
doScrape(l)
The code runs forever, as expected. I'm a bit stumped.
Is there anything silly that I might have missed?
Upvotes: 1
Views: 798
Reputation: 727
Aha, solved it! It was actually that I didn't realise that a default argument (here in TwitterAuth()) is evaluated at definition time. TwitterAuth reads the API key settings from a file handle, and the default argument opens up the default config file. Since this file handle is generated at definition time, both threads had the same handle, and once one had read it, the other one tried to read from the end of the file, throwing an exception. This is remedied by resetting the file before use, and using a mutex.
Cheers to Irmen de Jong for pointing me in the right direction.
Upvotes: 0
Reputation: 2847
get rid of the diaper pattern in your run() method, as in: get rid of that catch-all exception handler. You'll probably get the error printed there then. I think there may be something wrong in the DBWriter or other code you're calling from your doScrape function. Perhaps it is not thread-safe. That would explain why running it from the main program directly works, but calling it from a thread fails.
Upvotes: 1