foobarbecue
foobarbecue

Reputation: 7060

Gap detection in Django / MySQL

I have timeseries data stored in MySQL InnoDB and I access it using Django's object relational mapper.

My question is: how can I best identify and locate gaps in this timeseries data?

Edit for clarification: although it's fairly easy to get a list of all the missing data points, that doesn't completely solve my problem. I want just the start and end of gaps. The start and end of continuous periods would work just as well.

Edit for further clarification: The mysql columns for the table are below. time is a standard Django DateTimeField. The data in question is sampled once every 15 minutes.

mysql> show columns from datalogging_datapoint;
+----------------------+------------+------+-----+---------+----------------+
| Field                | Type       | Null | Key | Default | Extra          |
+----------------------+------------+------+-----+---------+----------------+
| id                   | int(11)    | NO   | PRI | NULL    | auto_increment |
| new_since_parsing    | tinyint(1) | NO   |     | NULL    |                |
| non_public           | tinyint(1) | NO   |     | NULL    |                |
| time                 | datetime   | NO   |     | NULL    |                |
| value                | double     | NO   |     | NULL    |                |
| parent_timeseries_id | int(11)    | NO   | MUL | NULL    |                |
+----------------------+------------+------+-----+---------+----------------+
6 rows in set (0.00 sec)

Upvotes: 1

Views: 513

Answers (3)

dr jimbob
dr jimbob

Reputation: 17731

You have to give some sort of sample data and how you like it processed. Telling us that you are storing it in MySQL or with innodb is not central to the problem (e.g., the ORM handles that). I assume you are able to extract your time series data as a list of integers, and you are trying to find where gaps start/end from this list.

def gaps(seq):
    seq_set = set(seq) # e.g., set([0, 1, 2, 3, 7, 8, 9, 10, 16, 17, 18])
    full_set = set(range(seq[-1]+1)) # set([0,1,2,3,..., 17, 18])
    missing_pts = list(seq_set ^ full_set) # [4, 5, 6, 11, 12, 13, 14, 15]
    missing_pts.sort() # EDIT: originally didn't have this; 
                       # should have as sets are unordered.
    missing_pt_pairs = []
    first_pt = missing_pts[0]
    prev_pt = missing_pts[0]
    for pt in missing_pts:
        if pt - prev_pt > 1:
            missing_pt_pairs.append((first_pt, prev_pt))
            first_pt = pt
        prev_pt = pt
    missing_pt_pairs.append((first_pt, pt))
    return missing_pt_pairs

time_pts = [0,1,2,3,7,8,9,10,16,17,18]
gaps(time_pts) # returns [(4,6), (11,15)], 
# indicating that two gaps are present starting from [4,6] and [11,15]

Upvotes: 1

foobarbecue
foobarbecue

Reputation: 7060

Thanks for the suggestions guys! I learned something from both of them.

However, I think I just solved my problem in an ideal way by rephrasing the question in my mind. Here's the basic idea:

Count the values in a that year with Django's .count() .
If not complete:
    Count the values for each month in that year
    If not complete:
        Count the values for each day in that month

Upvotes: 0

Spacedman
Spacedman

Reputation: 94202

Look into the numpy and scipy packages for python - you might find some time series analysis functions. Then its just a matter of getting the values out of the database, but thats standard django/python.

Do you want something like this:

def gaps(seq):
  ...
  return gaps_found

which, when fed [1,2,4,5,6,8,12] returns [3,7,9,10,11]? That might be doable with sets.

Upvotes: 0

Related Questions