Reputation:
I have a loop that scrapes through 30 links over and over and exactly every 10 hours or so or so I prefer to clear other list which keeps certain found data. Here's an abstract example. If I had time refresher in a separate thread.
def check_refresh(time_since_refresh):
time_difference = round((datime.now() - time_since_refresh).total_seconds()/60/60)
if time_difference == CLEAR_FOUND:
time_since_refresh = datetime.now()
return True
return False
while True:
scrape(url)
if check_refresh():
temporary_list.clear()
The thing is, speed of scraping is important for me, so if it goes through each loop checking if it is time to refresh I feel like it will slow it down.
Shoud time for refresh be a separate thread and having a boolean flag that the scraper will read through each loop?
Also is there a better way to implement how much time elapsed since beginning of loop without my "hack" with check_refresh
?
Upvotes: 0
Views: 21
Reputation: 942
Yes, threading is a better idea. Try this:
def check_refresh(time_since_refresh):
time_difference = round((datime.now() -
time_since_refresh).total_seconds()/60/60)
if time_difference == CLEAR_FOUND:
time_since_refresh = datetime.now()
return True
return False
t = threading.Thread(target=check_refresh)
while True:
scrape(url)
if t:
temporary_list.clear()
The reason threading is better because a thread does something while another code is being executed, instead of pausing the currently executing code. Hope this helps!
Upvotes: 1