Reputation: 919
My data has various columns including a date and a time column. The data is stretched across three months. I need to count no. of rows in a particular hour irrespective of the date. So that would mean getting the count of rows in 00:00 to 01:00 window and similarly for the rest 23 hours. How do I do that? Overall I will have 24 rows with their counts. Here is my data:
>>>df[["date","time"]]
date time
0 2006-11-10 00:01:21
1 2006-11-10 00:02:26
2 2006-11-10 00:02:38
3 2006-11-10 00:05:38
4 2006-11-10 00:05:38
Output should be like:
00:00-00:59 SomeCount
Both are object types
Upvotes: 2
Views: 1333
Reputation: 31011
My solution generates row counts for all 24 hours, with 0 for hours "absent" in the source DataFrame.
To show a more instructive example, I defined the source DataFrame containing rows from several hours:
date time
0 2006-11-10 01:21:00
1 2006-11-10 02:26:00
2 2006-11-10 02:38:00
3 2006-11-10 05:38:00
4 2006-11-10 05:38:00
5 2006-11-11 05:43:00
6 2006-11-11 05:51:00
Note that last 2 rows are from different date, but as you want grouping by hour only, they will be counted in the same group as previous 2 rows (hour 5).
The first step is to create a Series containing almost what you want:
wrk = df.groupby(pd.to_datetime(df.time).dt.hour).apply(
lambda grp: grp.index.size).reindex(range(24), fill_value=0)
The initial part of wrk is:
time
0 0
1 1
2 2
3 0
4 0
5 4
6 0
7 0
The left column (the index) contains hour as an integer and the right column is the count - how many rows are in this hour.
The only thing to do is to reformat the index to your desired format:
wrk.index = wrk.index.map(lambda h: f'{h:02}:00-{h:02}:59')
The result (initial part only) is:
time
00:00-00:59 0
01:00-01:59 1
02:00-02:59 2
03:00-03:59 0
04:00-04:59 0
05:00-05:59 4
06:00-06:59 0
07:00-07:59 0
But if you want to get counts only for hours present in your source
data, then drop .reindex(…)
from the above code.
Then your (full) result, for the above DataFrame will be:
time
01:00-01:59 1
02:00-02:59 2
05:00-05:59 4
dtype: int64
Upvotes: 1
Reputation: 863741
I think simpliest is convert both columns to datetimes and count hours by Series.dt.hour
with Series.value_counts
:
out = pd.to_datetime(df["date"] + ' ' + df["time"]).dt.hour.value_counts().sort_index()
Or if need your format use Series.dt.strftime
with GroupBy.size
:
s = pd.to_datetime(df["date"] + ' ' + df["time"]).dt.strftime('%H:00-%H:59')
print (s)
0 00:00-00:59
1 00:00-00:59
2 00:00-00:59
3 00:00-00:59
4 00:00-00:59
dtype: object
out = s.groupby(s, sort=False).size()
print (out)
00:00-00:59 5
dtype: int64
Last for DataFrame use:
df = out.rename_axis('times').reset_index(name='count')
Upvotes: 1
Reputation: 31236
groupby()
the hourimport io
df = pd.read_csv(io.StringIO(""" date time
0 2006-11-10 00:01:21
1 2006-11-10 00:02:26
2 2006-11-10 00:02:38
3 2006-11-10 00:05:38
4 2006-11-10 02:05:38"""), sep="\s\s+", engine="python")
dfc = (df.groupby(pd.to_datetime(df.time).dt.hour)
.apply(lambda d: pd.DataFrame({"count":[len(d)]},
index=[pd.to_datetime(d["time"]).min().strftime("%H:%M")
+"-"+pd.to_datetime(d["time"]).max().strftime("%H:%M")]))
.reset_index()
.drop(columns=["time"])
.rename(columns={"level_1":"time"})
)
time | count | |
---|---|---|
0 | 00:01-00:05 | 4 |
1 | 02:05-02:05 | 1 |
Upvotes: 1
Reputation: 189
You can split the time string with the delimiter :
. Then create an another column hour
for hour. Then use groupby()
to group them on the basis of new column hour
. You can now store the data in a new series or dataframe to get the desired output
Upvotes: 1