Alex
Alex

Reputation: 1583

Split value into bins based on time

I'm working in Python modifying New York City subway turnstile data to turn into a visualization of the entrance/exits for each station.

So far I have a list of entrance/exit counts based on start (03-24-15) and end (03-27-15)dates:

{
'endTime': '03-25-14T21:40:30',
'entriesDuringPeriod': 158,
'exitsDuringPeriod': 597,
'startTime': '03-25-14T17:03:23'
},
{
'endTime': '03-26-14T01:00:00',
'entriesDuringPeriod': 29,
'exitsDuringPeriod': 235,
'startTime': '03-25-14T21:00:00'
},

The problem I have is that the different time periods are not standardize and sometimes overlap. I'd like to be able to go through and create another list that normalizes these numbers into one hour increments.

I'm not very familiar with Python time processing, and I was wondering if someone could provide some information about how to get started taking strings, converting them into date objects, and dividing up values based on time.

The final visualization will be visualized using d3.js if that matters.

Upvotes: 1

Views: 2213

Answers (2)

Travis D.
Travis D.

Reputation: 342

Getting the strings into datetime objects isn't too bad:

from datetime import datetime
from time import time, mktime, strptime

def get_datetime( instr ):
  return datetime.fromtimestamp(mktime(strptime(instr, '%m-%d-%yT%H:%M:%S')))

# eg: get_datetime( '03-25-14T21:20:30' ) => datetime.datetime(2014, 3, 25, 21, 20, 30)

Binning / normalizing the data largely depends on how you want to handle the overlapping durations... Eg. Do you want to assume that people arrived & exited in a linear fashion, so that if the timestamps were for an hour and a half, 66% would go into the full hour and 33% into the other partial hour?

EDIT: Based on OP's comment, here's totally functional code:

from datetime import timedelta
from collections import defaultdict

def add_datum( dd, v ):
    end_dt = get_datetime(v['endTime'])
    start_dt = get_datetime(v['startTime'])
    total_duration = end_dt - start_dt 

    hour_start = datetime( year = start_dt.year, 
                           month = start_dt.month, 
                           day = start_dt.day, 
                           hour = start_dt.hour )
    hour_end = hour_start + timedelta( hours = 1 )

    while hour_start < end_dt:
        dt = min([hour_end, end_dt]) - max([ hour_start, start_dt ])
        fraction = 1.0 * dt.total_seconds() / total_duration.total_seconds()
        dd[ hour_start ]['hour'] = hour_start
        dd[ hour_start ]['entries'] += v['entriesDuringPeriod'] * fraction
        dd[ hour_start ]['exits'] += v['exitsDuringPeriod'] * fraction # exits

        hour_start = hour_end
        hour_end = hour_end + timedelta( hours = 1 )
    return dd


dd = defaultdict(lambda: {'entries':0,'exits':0})
all_data = [{ 'endTime': '03-25-14T21:40:30',
              'entriesDuringPeriod': 158,
              'exitsDuringPeriod': 597,
              'startTime': '03-25-14T17:03:23' },
            { 'endTime': '03-26-14T01:00:00',
              'entriesDuringPeriod': 29,
              'exitsDuringPeriod': 235,
              'startTime': '03-25-14T21:00:00' }]

[ add_datum( dd, i ) for i in all_data ]
res = dd.values()
res.sort( key = lambda i: i['hour'] )

print res
# [{'entries': 32.28038732182594,
#   'exits': 121.97083057677271,
#   'hour': datetime.datetime(2014, 3, 25, 17, 0)},
#  {'entries': 34.209418415829674,
#   'exits': 129.25963793829314,
#   'hour': datetime.datetime(2014, 3, 25, 18, 0)},
#  {'entries': 34.209418415829674,
#   'exits': 129.25963793829314,
#   'hour': datetime.datetime(2014, 3, 25, 19, 0)},
#  {'entries': 34.209418415829674,
#   'exits': 129.25963793829314,
#   'hour': datetime.datetime(2014, 3, 25, 20, 0)},
#  {'entries': 30.34135743068503,
#   'exits': 146.00025560834786,
#   'hour': datetime.datetime(2014, 3, 25, 21, 0)},
#  {'entries': 7.25,
#   'exits': 58.75,
#   'hour': datetime.datetime(2014, 3, 25, 22, 0)},
#  {'entries': 7.25,
#   'exits': 58.75,
#   'hour': datetime.datetime(2014, 3, 25, 23, 0)},
#  {'entries': 7.25,
#   'exits': 58.75,
#   'hour': datetime.datetime(2014, 3, 26, 0, 0)}]

Upvotes: 1

Alex Laties
Alex Laties

Reputation: 70

The way to parse the timestamps as given would be:

import datetime
datetime.datetime.strptime("03-25-14T21:08:12", "%m-%d-%yT%H:%M:%S")

That will provide you a datetime object, which is comparable with other datetime objects. The datetime library itself contains all the functions to create arbitrary datetime objects that can then be used for sorting/binning purposes.

Given that the time periods in the data are non-uniform and contain overlap, there are a few approaches possible. If you're alright with linearly averaging entries and exits, you can take each time period and calculate how many entries and exits occur per hour on average, then, given an hour, you could iterate through all data points, find how much a data point overlaps with that hour (i.e. 15 minutes or the whole hour), and apply the data point's average entries/exits per hour modified by the percentage of overlap to an accumulator.

In terms of pseudo-code:

assume we have an hour to data accumulation bag called htdab
for given hour in time series
    for data point in all data points
        if data point's start <= given hour <= data point's end
            calculate percentage of overlap on top of given hour
            add data point's entries per hour * percentage of overlap to our htdab's entry for given hour
            add data point's exits per hour * percentage of overlap to our htdab's entry for given hour

Upvotes: 1

Related Questions