Reputation: 1702
I have a dataframe with a column of datetime type that I'd like to create a 2nd categorical column based upon. The second column should consist of string identifiers using the following logic:
ON: [16:00-09:30)
FH: [09:30-10:00)
M: [10:00-15:00)
SLH: [15:00-15:30)
LH: [15:30-16:00)
For instance, for dataframe d
:
date_rng = pd.date_range(start='2024-01-01', periods=48, freq='30min')
d = pd.DataFrame({'datetime': date_rng})
I'd like the output below:
d.head(40)
Out[186]:
datetime part
0 2024-01-01 00:00:00 ON
1 2024-01-01 00:30:00 ON
2 2024-01-01 01:00:00 ON
3 2024-01-01 01:30:00 ON
4 2024-01-01 02:00:00 ON
5 2024-01-01 02:30:00 ON
6 2024-01-01 03:00:00 ON
7 2024-01-01 03:30:00 ON
8 2024-01-01 04:00:00 ON
9 2024-01-01 04:30:00 ON
10 2024-01-01 05:00:00 ON
11 2024-01-01 05:30:00 ON
12 2024-01-01 06:00:00 ON
13 2024-01-01 06:30:00 ON
14 2024-01-01 07:00:00 ON
15 2024-01-01 07:30:00 ON
16 2024-01-01 08:00:00 ON
17 2024-01-01 08:30:00 ON
18 2024-01-01 09:00:00 FH
19 2024-01-01 09:30:00 FH
20 2024-01-01 10:00:00 M
21 2024-01-01 10:30:00 M
22 2024-01-01 11:00:00 M
23 2024-01-01 11:30:00 M
24 2024-01-01 12:00:00 M
25 2024-01-01 12:30:00 M
26 2024-01-01 13:00:00 M
27 2024-01-01 13:30:00 M
28 2024-01-01 14:00:00 M
29 2024-01-01 14:30:00 M
30 2024-01-01 15:00:00 SLH
31 2024-01-01 15:30:00 LH
32 2024-01-01 16:00:00 ON
33 2024-01-01 16:30:00 ON
34 2024-01-01 17:00:00 ON
35 2024-01-01 17:30:00 ON
36 2024-01-01 18:00:00 ON
37 2024-01-01 18:30:00 ON
38 2024-01-01 19:00:00 ON
39 2024-01-01 19:30:00 ON
Upvotes: 0
Views: 53
Reputation: 14369
Option 1: pd.cut
base_time = d['datetime'].dt.normalize() + pd.Timedelta(hours=16)
offset = (d['datetime'] - base_time).dt.total_seconds().mod(86400) # 24 * 3600
bins = [hours * 3600 for hours in [0, 17.5, 18, 23, 23.5, 24]]
labels = ['ON', 'FH', 'M', 'SLH', 'LH']
d['part'] = pd.cut(offset, bins=bins, labels=labels, right=False)
Output:
# showing first row per continuous value
d[d['part'].ne(d['part'].shift())]
datetime part
0 2024-01-01 00:00:00 ON
19 2024-01-01 09:30:00 FH
20 2024-01-01 10:00:00 M
30 2024-01-01 15:00:00 SLH
31 2024-01-01 15:30:00 LH
32 2024-01-01 16:00:00 ON
with:
d['part'].dtype
CategoricalDtype(categories=['ON', 'FH', 'M', 'SLH', 'LH'], ordered=True,
categories_dtype=object)
Explanation
Series.dt.normalize
+ offset 16 hours with pd.Timedelta
to set a base_time
at 16:00.d['datetime']
and get Series.dt.total_seconds
+ Series.mod
with 24 * 3600
(day in seconds).bins
with endpoints set against 16:00 as 0. E.g., 17.5 = 8 hours until end day + 9.5 hours next day (09:30).labels
and apply pd.cut
with right=False
.Option 2: np.select
import numpy as np
delta = (d['datetime'] - d['datetime'].dt.normalize()).dt.total_seconds()
condlist = [
(delta < hours * 3600) for hours in [9.5, 10, 15, 15.5, 16]
]
d['part2'] = np.select(condlist=condlist, choicelist=labels,
default='ON')
d['part2'] = pd.Categorical(d['part2'], categories=labels, ordered=True)
Equality check:
d['part'].equals(d['part2'])
# True
Explanation
Series.dt.normalize
+ df.total_seconds
.condlist
("bins") with delta
smaller than 09:30 (9.5), 10, etc.condlist
to np.select
and use default='ON'
to get >= 16:00 as 'ON' as well.pd.Categorical
with categories=labels
and ordered=True
to get: ['ON' < 'FH' < 'M' < 'SLH' < 'LH']
.Upvotes: 3
Reputation: 142985
You can create function which gets one value and convert it to one of strings ON
,FH
, etc.
and later you could use
df['part'] = df['datetime'].apply(your_function)
Minimal working code.
I get time converted to string
and compare with string
but you can also compare directly objects datetime.time
.
You may also rewrite it without some >=
import pandas as pd
date_rng = pd.date_range(start='2024-01-01', periods=48, freq='30min')
df = pd.DataFrame({'datetime': date_rng})
def my_function(item):
text = str(item.time())
#print(text)
if text >= '16:00:00' or text < '09:30:00': # here is `or` instead of `and` because value has to be in one of ranges [16:00-23:59] or [00:00-09:30)
return "ON"
elif text >= '09:30:00' and text < '10:00:00':
return "FH"
elif text >= '10:00:00' and text < '15:00:00':
return "M"
elif text >= '15:00:00' and text < '15:30:00':
return "SLH"
elif text >= '15:30:00' and text < '16:00:00':
return "LH"
return "???"
df['part'] = df['datetime'].apply(my_function)
print(df)
Upvotes: 1