camiel1985
camiel1985

Reputation: 53

makings a bins column with dates in python

In python i have data that looks like this with 500.000 rows :

TIME                            count

1-1-1900 10:41:00        1

3-1-1900 09:54:00        1

4-1-1900 15:45:00        1

5-1-1900 18:41:00        1

4-1-1900 15:45:00        1

and i want to make a new column with bins in quarters like this:

bins                count

9:00-9:15        2

9:15-9:30        4

9:30-9:45        4

10:00-10:15     4

i know how you make bins, but the timestamp gives me troubles. Can somebody help me with this? already thank you!

Upvotes: 4

Views: 4734

Answers (3)

K14
K14

Reputation: 191

I know it's late. But better late than never. I also came across a similar requirement and done by using pandas library.

  • First, Load data in pandas data-frame

  • Second, check TIME column must be datetime object and not object type (like string or whatever). You can check it by

    df.info()

for example, in my case TIME column was initially of object type i.e. string type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17640 entries, 0 to 17639
Data columns (total 3 columns):
TIME           17640 non-null object
value          17640 non-null int64
dtypes: int64(1), object(2)
memory usage: 413.5+ KB
  • if that is the case, then convert it to pandas datetime object by using this command

    df['TIME'] = pd.to_datetime(df['TIME']) ignore this if already in datetime format

df.info() now gives updated format

 <class 'pandas.core.frame.DataFrame'>
 RangeIndex: 17640 entries, 0 to 17639
 Data columns (total 3 columns):
 TIME           17640 non-null datetime64[ns]
 value          17640 non-null int64
 dtypes: datetime64[ns](2), int64(1)
 memory usage: 413.5 KB
  • Now our dataframe is ready for magic :)

       counts = pd.Series(index=df.TIME, data=np.array(df.count)).resample('15T').count()
       print(counts[:3])
    
     TIME
     2017-07-01 00:00:00    3
     2017-07-01 00:15:00    3
     2017-07-01 00:30:00    3
     Freq: 15T, dtype: int64
    

    in above command 15T means 15minutes bucket, you can replace it with D for day bucket, 2D for 2 days bucket, M for month bucket, 2M for 2 months bucket and so on. You can read the detail of these notations on this link

  • now, our buckets data is done as you can see above. for time range use this command. Use the same time range as of data. In my case, my data was 3 months so I am creating time-range of 3 months.

 r = pd.date_range('2017-07', '2017-09', freq='15T')
 x = np.repeat(np.array(r), 2, axis=0)[1:-1]
 # now reshape data to fit in Dataframe
 x = np.array(x)[:].reshape(-1, 2)
 # now fit in dataframe and print it
 final_df = pd.DataFrame(x, columns=['start', 'end'])
 print(final_df[:3])
                  start                 end
0   2017-07-01 00:00:00 2017-07-01 00:15:00
1   2017-07-01 00:15:00 2017-07-01 00:30:00
2   2017-07-01 00:30:00 2017-07-01 00:45:00

date ranges also done

  • Now append count and dateranges to get final outcome

     final_df['count'] = np.array(means)
     print(final_df[:3])
    
                  start                 end count
0   2017-07-01 00:00:00 2017-07-01 00:15:00     3
1   2017-07-01 00:15:00 2017-07-01 00:30:00     3
2   2017-07-01 00:30:00 2017-07-01 00:45:00     3

Hope anyone find it useful.

Upvotes: 3

yucer
yucer

Reputation: 5049

Just trying without pandas:

from collections import defaultdict
import datetime as dt
from itertools import groupby

def bin_ts(dtime, delta):
    modulo = dtime.timestamp() % delta.total_seconds()
    return dtime - dt.timedelta(seconds=modulo)

src_data = [
    ('1-1-1900 10:41:00', 1),
    ('3-1-1900 09:54:00', 1),
    ('4-1-1900 15:45:00', 1),
    ('5-1-1900 18:41:00', 1),
    ('4-1-1900 15:45:00', 1)
]

ts_data = [(dt.datetime.strptime(ts, '%d-%m-%Y %H:%M:%S'), count) for ts, count in src_data]

bin_size = dt.timedelta(minutes=15)

binned = [(bin_ts(ts, bin_size), count) for ts, count in ts_data]

def time_fmt(ts):
    res = "%s - %s" % (ts.strftime('%H:%M'), (ts + bin_size).strftime('%H:%M'))
    return res

binned_time = [(time_fmt(ts), count) for ts, count in binned]

cnts = defaultdict(int)
for ts, group in groupby(binned_time, lambda x: x[0]):
    for row in group:
        cnts[ts] += row[1]

output = list(cnts.items())

output.sort(key=lambda x: x[0])

from pprint import pprint
pprint(output)

result in:

[('09:45 - 10:00', 1),
 ('10:30 - 10:45', 1),
 ('15:45 - 16:00', 2),
 ('18:30 - 18:45', 1)]

Upvotes: 0

Eli Korvigo
Eli Korvigo

Reputation: 10513

Well, I'm not sure that this is what you asked for. If it's not, I would recommend you to improve your question, because it's very hard to understand your problem. In particular, it would be nice to see what you've already tried to do.

from __future__ import division, print_function
from collections import namedtuple
from itertools import product
from datetime import time
from StringIO import StringIO


MAX_HOURS = 23
MAX_MINUTES = 59


def process_data_file(data_file):
    """
    The data_file is supposed to be an opened file object
    """
    time_entry = namedtuple("time_entry", ["time", "count"])
    data_to_bin = []
    for line in data_file:
        t, count = line.rstrip().split("\t")
        t = map(int, t.split()[-1].split(":")[:2])
        data_to_bin.append(time_entry(time(*t), int(count)))
    return data_to_bin


def make_milestones(min_hour=0, max_hour=MAX_HOURS, interval=15):
    minutes = [minutes for minutes in xrange(MAX_MINUTES+1) if not minutes % interval]
    hours = range(min_hour, max_hour+1)
    return [time(*milestone) for milestone in list(product(hours, minutes))]


def bin_time(data_to_bin, milestones):
    time_entry = namedtuple("time_entry", ["time", "count"])
    data_to_bin = sorted(data_to_bin, key=lambda time_entry: time_entry.time, reverse=True)
    binned_data = []
    current_count = 0
    upper = milestones.pop()
    lower = milestones.pop()
    for entry in data_to_bin:
        while not lower <= entry.time <= upper:
            if current_count:
                binned_data.append(time_entry("{}-{}".format(str(lower)[:-3], str(upper)[:-3]), current_count))
                current_count = 0
            upper, lower = lower, milestones.pop()
        current_count += entry.count
    return binned_data


data_file = StringIO("""1-1-1900 10:41:00\t1
3-1-1900 09:54:00\t1
4-1-1900 15:45:00\t1
5-1-1900 18:41:00\t1
4-1-1900 15:45:00\t1""")


binned_time = bin_time(process_data_file(data_file), make_milestones())
for entry in binned_time:
    print(entry.time, entry.count, sep="\t")

The output:

18:30-18:45 1
15:45-16:00 2
10:30-10:45 1

Upvotes: 1

Related Questions