Reputation: 905
I have a log file that's just date in one column and time in the other. I'm trying to find the most popular time periods.
Date Time
Jan/01/2017 08:23:45
Jan/01/2017 15:54:21
Jan/02/2017 04:02:39
Jan/03/2017 06:33:12
...
I'm looking for an efficient way to group the data into 10-minute portions and then find the most popular 1-hour-intervals. So it's very likely the most popular hour-long-intervals would be consecutive like:
Interval Count
08:10:00 - 09:10:00 586
08:20:00 - 09:20:00 565
08:30:00 - 09:30:00 544
...
This has to scale up well to GB of data and I need to be able to find the most popular intervals preferably without sorting the entire table.
Upvotes: 1
Views: 574
Reputation: 54233
You could convert to minutes since midnight, use integer division and a Counter
. No need to sort the data, this should work fine and be efficient :
from collections import Counter
log = """Jan/01/2017 08:23:45
Jan/01/2017 15:54:21
Jan/01/2017 15:50:21
Jan/01/2017 15:52:21
Jan/02/2017 04:02:39
Jan/03/2017 06:33:12"""
portion = 10
interval = 60
counter = Counter()
for line in log.split("\n"):
time = line.split()[-1]
hour, minute, second = map(int, time.split(':'))
since_midnight = hour * 60 + minute
counter[since_midnight // portion] += 1
for slot, count in counter.most_common():
print("%02d:%02d -> %02d:%02d - %d" % ((slot * portion) / 60,
(slot * portion) % 60,
((slot + 1) * portion) / 60,
((slot + 1) * portion) % 60,
count))
It outputs :
15:50 -> 16:00 - 3
04:00 -> 04:10 - 1
08:20 -> 08:30 - 1
06:30 -> 06:40 - 1
Since you didn't write any code, I'll leave an exercise to you : for a given 10-minute portion, increment the counter of every 60-min interval containing this portion. A simple for loop should do.
Also, you should read the file line by line. The split("\n")
was just for a simple example.
Upvotes: 1