Apython
Apython

Reputation: 453

Job scheduling for data scraping on Python

I'm scraping (extracting) data from a certain website. The data contains two values that I need, namely (grid) frequency value and time.

The data on the website is being updated every second. I'd like to continuously save these values (append them) into a list or a tuple using python. To do that I tried using schedule library. The following job schedule commands run the data scraping function (socket_freq) every second.

import schedule
schedule.every(1).seconds.do(socket_freq)

while True:
    schedule.run_pending()

I'm facing two problems:

  1. I don't know how to restrict the schedule to run during a chosen time interval. For example, i'd like to run it for 5 or 10 minutes. how do I define that? I mean how to I tell the schedule to stop after a certain time.
  2. if I run this code and stop it after few seconds (using break), then I often get multiple entries, for example here is one result, where the first list[ ] in the tuple refers to the time value and the second list[ ] is the values of frequency:

out:

(['19:27:02','19:27:02','19:27:02','19:27:03','19:27:03','19:27:03','19:27:03','19:27:03','19:27:03','19:27:03','19:27:04','19:27:04','19:27:04', ...], 
['50.020','50.020','50.020','50.018','50.018','50.018','50.018','50.018','50.018','50.018','50.017','50.017','50.017'...])

As you can see, the time variable is entered (appended) multiple times, although I used a schedule that runs every 1 second. What i'd actually would expect to retrieve is:

out:

(['19:27:02','19:27:03','19:27:04'],['50.020','50.018','50.017'])

Does anybody know how to solve these problems?

Thanks!

(I'm using python 2.7.9)

Upvotes: 0

Views: 2900

Answers (1)

Etaoin
Etaoin

Reputation: 149

Ok, so here's how I would tackle these problems:

  1. Try to obtain a timestamp at the start of your program and then simply check if it has been working long enough each time you execute piece of code you are scheduling.
  2. Use time.sleep() to put your program to sleep for a period of time.

Check my example below:

import schedule
import datetime
import time

# Obtain current time
start = datetime.datetime.now()

# Simple callable for example
class DummyClock:
  def __call__(self):
    print datetime.datetime.now()

schedule.every(1).seconds.do(DummyClock())

while True:
    schedule.run_pending()
    # 5 minutes == 300 seconds
    if (datetime.datetime.now() - start).seconds >= 300:
        break
    # And here we halt execution for a second
    time.sleep(1)

All refactoring is welcome

Upvotes: 2

Related Questions