eddiewastaken
eddiewastaken

Reputation: 832

How do I represent time as an int in a pandas DataFrame?

I have a DataFrame, read in from a CSV, as such (times as HH:MM:SS):

    pta         ptd         tpl_num
4   05:17       05:18       0
6   05:29:30    05:30       1
9   05:42       05:44:30    2
11  05:53       05:54       3
12  06:03       06:05:30    4
17  06:24:30    NaN         5

dtypes:

pta                object
ptd                object
tpl_num             int64

I'm trying to get the pta and ptd columns as an int, in the format:

    pta      ptd      tpl_num
4   51700    51800    0
6   52930    53000    1
9   54200    54430    2
11  55300    55400    3
12  60300    60530    4
17  62430    NaN      5

Seems quite simple - pad trailing zeroes, and convert to an int. But I can't find a way to do it. I've converted to string, padded the zeroes, stripped the colons, but that won't allow me to convert to int due to the NaN cells not being recognised. I can't convert to datetime without having the :SS in all cells. Not sure what to do.

Upvotes: 0

Views: 244

Answers (3)

AirSquid
AirSquid

Reputation: 11883

Here is an adjustment that converts to decimalized minutes, which should be appropriate for training:

import pandas as pd
import numpy as np
from datetime import datetime

col = pd.Series(['5:17', '05:30', np.NaN, '12:30:10'])

def to_minutes(t):
    if pd.isna(t):
        return t
    if t.count(':')==2:
        t = [int(s) for s in t.split(':')]
        return 60*t[0] + t[1] + 1.0/60*t[2]
    if t.count(':')==1:
        t = [int(s) for s in t.split(':')]
        return 60*t[0] + t[1] 


col_converted = [to_minutes(t) for t in col]

print(col_converted)

Output:

[317, 330, nan, 750.1666666666666]

Regarding your comment that you just want to convert to an integer for scikit.learn, you MUST convert to a common set of units. You cannot just slam together converted integers. If you do, you have changed a continuous variable to an ordinal variable because you have destroyed the linearity. Unless you desire ordinal variable don't do this. For example, if we look at 10-minute increments across an hour boundary, you get something like this:

from matplotlib import pyplot as plt

col = pd.Series(['5:00', '05:10', '5:20', '5:30', '5:40', '5:50', '6:00', '6:10', '6:20', 
    '6:30', '6:40', '6:50'])

def to_minutes(t):
    if pd.isna(t):
        return t
    if t.count(':')==2:
        t = [int(s) for s in t.split(':')]
        return 60*t[0] + t[1] + 1.0/60*t[2]
    if t.count(':')==1:
        t = [int(s) for s in t.split(':')]
        return 60*t[0] + t[1] 

def to_int(t):
    return int(t.replace(':',''))



true_mins = [to_minutes(t) for t in col]

skewed_time = [to_int(t) for t in col]
print(true_mins)
print(skewed_time)


plt.plot(true_mins, skewed_time)
plt.xlabel('actual minutes')
plt.ylabel('bogus time')
plt.show()

bogus time vs. actual time

Upvotes: 0

anky
anky

Reputation: 75080

IIUC, you could use;

final = df.assign(**pd.to_datetime(df[['pta','ptd']].stack()).dt.time.astype(str)
      .str.replace(':','').astype(int).unstack())

        pta      ptd  tpl_num
4   51700.0  51800.0        0
6   52930.0  53000.0        1
9   54200.0  54430.0        2
11  55300.0  55400.0        3
12  60300.0  60530.0        4
17  62430.0      NaN        5

Upvotes: 1

AirSquid
AirSquid

Reputation: 11883

I'm assuming you want to eventually get to datetime objects.

You could make a helper function to convert the raw strings like:

import pandas as pd
import numpy as np
from datetime import datetime

col = pd.Series(['5:17', '05:30', np.NaN, '12:30:10'])

def date_converter(t):
    if pd.isna(t):
        return t
    if t.count(':')==2:
        return datetime.strptime(t, '%H:%M:%S')
    if t.count(':')==1:
        return datetime.strptime(t, '%H:%M')
    else:
        return None

col_converted = [date_converter(t) for t in col]

print(col_converted)

Output:

[datetime.datetime(1900, 1, 1, 5, 17), datetime.datetime(1900, 1, 1, 5, 30), nan, datetime.datetime(1900, 1, 1, 12, 30, 10)]

Upvotes: 0

Related Questions