Reputation: 53
In python i have data that looks like this with 500.000 rows :
TIME count
1-1-1900 10:41:00 1
3-1-1900 09:54:00 1
4-1-1900 15:45:00 1
5-1-1900 18:41:00 1
4-1-1900 15:45:00 1
and i want to make a new column with bins in quarters like this:
bins count
9:00-9:15 2
9:15-9:30 4
9:30-9:45 4
10:00-10:15 4
i know how you make bins, but the timestamp gives me troubles. Can somebody help me with this? already thank you!
Upvotes: 4
Views: 4734
Reputation: 191
I know it's late. But better late than never. I also came across a similar requirement and done by using pandas library.
First, Load data in pandas data-frame
Second, check TIME column must be datetime object and not object type (like string or whatever). You can check it by
df.info()
for example, in my case TIME column was initially of object type i.e. string type
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17640 entries, 0 to 17639
Data columns (total 3 columns):
TIME 17640 non-null object
value 17640 non-null int64
dtypes: int64(1), object(2)
memory usage: 413.5+ KB
if that is the case, then convert it to pandas datetime object by using this command
df['TIME'] = pd.to_datetime(df['TIME'])
ignore this if already in datetime format
df.info()
now gives updated format
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17640 entries, 0 to 17639
Data columns (total 3 columns):
TIME 17640 non-null datetime64[ns]
value 17640 non-null int64
dtypes: datetime64[ns](2), int64(1)
memory usage: 413.5 KB
Now our dataframe is ready for magic :)
counts = pd.Series(index=df.TIME, data=np.array(df.count)).resample('15T').count()
print(counts[:3])
TIME
2017-07-01 00:00:00 3
2017-07-01 00:15:00 3
2017-07-01 00:30:00 3
Freq: 15T, dtype: int64
in above command 15T
means 15minutes bucket, you can replace it with D
for day bucket, 2D
for 2 days bucket, M
for month bucket, 2M
for 2 months bucket and so on. You can read the detail of these notations on this link
now, our buckets data is done as you can see above. for time range use this command. Use the same time range as of data. In my case, my data was 3 months so I am creating time-range of 3 months.
r = pd.date_range('2017-07', '2017-09', freq='15T')
x = np.repeat(np.array(r), 2, axis=0)[1:-1]
# now reshape data to fit in Dataframe
x = np.array(x)[:].reshape(-1, 2)
# now fit in dataframe and print it
final_df = pd.DataFrame(x, columns=['start', 'end'])
print(final_df[:3])
start end
0 2017-07-01 00:00:00 2017-07-01 00:15:00
1 2017-07-01 00:15:00 2017-07-01 00:30:00
2 2017-07-01 00:30:00 2017-07-01 00:45:00
date ranges also done
Now append count and dateranges to get final outcome
final_df['count'] = np.array(means)
print(final_df[:3])
start end count
0 2017-07-01 00:00:00 2017-07-01 00:15:00 3
1 2017-07-01 00:15:00 2017-07-01 00:30:00 3
2 2017-07-01 00:30:00 2017-07-01 00:45:00 3
Hope anyone find it useful.
Upvotes: 3
Reputation: 5049
Just trying without pandas:
from collections import defaultdict
import datetime as dt
from itertools import groupby
def bin_ts(dtime, delta):
modulo = dtime.timestamp() % delta.total_seconds()
return dtime - dt.timedelta(seconds=modulo)
src_data = [
('1-1-1900 10:41:00', 1),
('3-1-1900 09:54:00', 1),
('4-1-1900 15:45:00', 1),
('5-1-1900 18:41:00', 1),
('4-1-1900 15:45:00', 1)
]
ts_data = [(dt.datetime.strptime(ts, '%d-%m-%Y %H:%M:%S'), count) for ts, count in src_data]
bin_size = dt.timedelta(minutes=15)
binned = [(bin_ts(ts, bin_size), count) for ts, count in ts_data]
def time_fmt(ts):
res = "%s - %s" % (ts.strftime('%H:%M'), (ts + bin_size).strftime('%H:%M'))
return res
binned_time = [(time_fmt(ts), count) for ts, count in binned]
cnts = defaultdict(int)
for ts, group in groupby(binned_time, lambda x: x[0]):
for row in group:
cnts[ts] += row[1]
output = list(cnts.items())
output.sort(key=lambda x: x[0])
from pprint import pprint
pprint(output)
result in:
[('09:45 - 10:00', 1),
('10:30 - 10:45', 1),
('15:45 - 16:00', 2),
('18:30 - 18:45', 1)]
Upvotes: 0
Reputation: 10513
Well, I'm not sure that this is what you asked for. If it's not, I would recommend you to improve your question, because it's very hard to understand your problem. In particular, it would be nice to see what you've already tried to do.
from __future__ import division, print_function
from collections import namedtuple
from itertools import product
from datetime import time
from StringIO import StringIO
MAX_HOURS = 23
MAX_MINUTES = 59
def process_data_file(data_file):
"""
The data_file is supposed to be an opened file object
"""
time_entry = namedtuple("time_entry", ["time", "count"])
data_to_bin = []
for line in data_file:
t, count = line.rstrip().split("\t")
t = map(int, t.split()[-1].split(":")[:2])
data_to_bin.append(time_entry(time(*t), int(count)))
return data_to_bin
def make_milestones(min_hour=0, max_hour=MAX_HOURS, interval=15):
minutes = [minutes for minutes in xrange(MAX_MINUTES+1) if not minutes % interval]
hours = range(min_hour, max_hour+1)
return [time(*milestone) for milestone in list(product(hours, minutes))]
def bin_time(data_to_bin, milestones):
time_entry = namedtuple("time_entry", ["time", "count"])
data_to_bin = sorted(data_to_bin, key=lambda time_entry: time_entry.time, reverse=True)
binned_data = []
current_count = 0
upper = milestones.pop()
lower = milestones.pop()
for entry in data_to_bin:
while not lower <= entry.time <= upper:
if current_count:
binned_data.append(time_entry("{}-{}".format(str(lower)[:-3], str(upper)[:-3]), current_count))
current_count = 0
upper, lower = lower, milestones.pop()
current_count += entry.count
return binned_data
data_file = StringIO("""1-1-1900 10:41:00\t1
3-1-1900 09:54:00\t1
4-1-1900 15:45:00\t1
5-1-1900 18:41:00\t1
4-1-1900 15:45:00\t1""")
binned_time = bin_time(process_data_file(data_file), make_milestones())
for entry in binned_time:
print(entry.time, entry.count, sep="\t")
The output:
18:30-18:45 1
15:45-16:00 2
10:30-10:45 1
Upvotes: 1