Chris
Chris

Reputation: 1702

Create a categorical column from a datetime column in Python

I have a dataframe with a column of datetime type that I'd like to create a 2nd categorical column based upon. The second column should consist of string identifiers using the following logic:

ON: [16:00-09:30)
FH: [09:30-10:00)
M: [10:00-15:00)
SLH: [15:00-15:30)
LH: [15:30-16:00) 

For instance, for dataframe d:

date_rng = pd.date_range(start='2024-01-01', periods=48, freq='30min')  
d = pd.DataFrame({'datetime': date_rng})
    

I'd like the output below:

d.head(40)
Out[186]: 
              datetime part
0  2024-01-01 00:00:00   ON
1  2024-01-01 00:30:00   ON
2  2024-01-01 01:00:00   ON
3  2024-01-01 01:30:00   ON
4  2024-01-01 02:00:00   ON
5  2024-01-01 02:30:00   ON
6  2024-01-01 03:00:00   ON
7  2024-01-01 03:30:00   ON
8  2024-01-01 04:00:00   ON
9  2024-01-01 04:30:00   ON
10 2024-01-01 05:00:00   ON
11 2024-01-01 05:30:00   ON
12 2024-01-01 06:00:00   ON
13 2024-01-01 06:30:00   ON
14 2024-01-01 07:00:00   ON
15 2024-01-01 07:30:00   ON
16 2024-01-01 08:00:00   ON
17 2024-01-01 08:30:00   ON
18 2024-01-01 09:00:00   FH
19 2024-01-01 09:30:00   FH
20 2024-01-01 10:00:00    M
21 2024-01-01 10:30:00    M
22 2024-01-01 11:00:00    M
23 2024-01-01 11:30:00    M
24 2024-01-01 12:00:00    M
25 2024-01-01 12:30:00    M
26 2024-01-01 13:00:00    M
27 2024-01-01 13:30:00    M
28 2024-01-01 14:00:00    M
29 2024-01-01 14:30:00    M
30 2024-01-01 15:00:00  SLH
31 2024-01-01 15:30:00   LH
32 2024-01-01 16:00:00   ON
33 2024-01-01 16:30:00   ON
34 2024-01-01 17:00:00   ON
35 2024-01-01 17:30:00   ON
36 2024-01-01 18:00:00   ON
37 2024-01-01 18:30:00   ON
38 2024-01-01 19:00:00   ON
39 2024-01-01 19:30:00   ON

Upvotes: 0

Views: 53

Answers (2)

ouroboros1
ouroboros1

Reputation: 14369

Option 1: pd.cut

base_time = d['datetime'].dt.normalize() + pd.Timedelta(hours=16)

offset = (d['datetime'] - base_time).dt.total_seconds().mod(86400) # 24 * 3600

bins = [hours * 3600 for hours in [0, 17.5, 18, 23, 23.5, 24]]

labels = ['ON', 'FH', 'M', 'SLH', 'LH']

d['part'] = pd.cut(offset, bins=bins, labels=labels, right=False)

Output:

# showing first row per continuous value
d[d['part'].ne(d['part'].shift())]

              datetime part
0  2024-01-01 00:00:00   ON
19 2024-01-01 09:30:00   FH
20 2024-01-01 10:00:00    M
30 2024-01-01 15:00:00  SLH
31 2024-01-01 15:30:00   LH
32 2024-01-01 16:00:00   ON

with:

d['part'].dtype

CategoricalDtype(categories=['ON', 'FH', 'M', 'SLH', 'LH'], ordered=True, 
                 categories_dtype=object)

Explanation

  • Use Series.dt.normalize + offset 16 hours with pd.Timedelta to set a base_time at 16:00.
  • Subtract from d['datetime'] and get Series.dt.total_seconds + Series.mod with 24 * 3600 (day in seconds).
  • Create bins with endpoints set against 16:00 as 0. E.g., 17.5 = 8 hours until end day + 9.5 hours next day (09:30).
  • Set labels and apply pd.cut with right=False.

Option 2: np.select

import numpy as np

delta = (d['datetime'] - d['datetime'].dt.normalize()).dt.total_seconds()

condlist = [
    (delta < hours * 3600) for hours in [9.5, 10, 15, 15.5, 16]
    ]

d['part2'] = np.select(condlist=condlist, choicelist=labels, 
                       default='ON')

d['part2'] = pd.Categorical(d['part2'], categories=labels, ordered=True)

Equality check:

d['part'].equals(d['part2'])
# True

Explanation

  • Get timedelta in seconds with Series.dt.normalize + df.total_seconds.
  • Set up condlist ("bins") with delta smaller than 09:30 (9.5), 10, etc.
  • Pass condlist to np.select and use default='ON' to get >= 16:00 as 'ON' as well.
  • Finally, apply pd.Categorical with categories=labels and ordered=True to get: ['ON' < 'FH' < 'M' < 'SLH' < 'LH'].

Upvotes: 3

furas
furas

Reputation: 142985

You can create function which gets one value and convert it to one of strings ON,FH, etc.
and later you could use

df['part'] = df['datetime'].apply(your_function)

Minimal working code.

I get time converted to string and compare with string
but you can also compare directly objects datetime.time.

You may also rewrite it without some >=

import pandas as pd

date_rng = pd.date_range(start='2024-01-01', periods=48, freq='30min')  
df = pd.DataFrame({'datetime': date_rng})

def my_function(item):
    text = str(item.time()) 
    #print(text)
    if text >= '16:00:00' or text < '09:30:00':  # here is `or` instead of `and` because value has to be in one of ranges [16:00-23:59] or [00:00-09:30)
        return "ON"
    elif text >= '09:30:00' and text < '10:00:00':
        return "FH"
    elif text >= '10:00:00' and text < '15:00:00':
        return "M"
    elif text >= '15:00:00' and text < '15:30:00':
        return "SLH"
    elif text >= '15:30:00' and text < '16:00:00':
        return "LH"
    return "???"

df['part'] = df['datetime'].apply(my_function)

print(df)

Upvotes: 1

Related Questions