Reputation: 832
I have a DataFrame, read in from a CSV, as such (times as HH:MM:SS):
pta ptd tpl_num
4 05:17 05:18 0
6 05:29:30 05:30 1
9 05:42 05:44:30 2
11 05:53 05:54 3
12 06:03 06:05:30 4
17 06:24:30 NaN 5
dtypes
:
pta object
ptd object
tpl_num int64
I'm trying to get the pta
and ptd
columns as an int
, in the format:
pta ptd tpl_num
4 51700 51800 0
6 52930 53000 1
9 54200 54430 2
11 55300 55400 3
12 60300 60530 4
17 62430 NaN 5
Seems quite simple - pad trailing zeroes, and convert to an int. But I can't find a way to do it. I've converted to string, padded the zeroes, stripped the colons, but that won't allow me to convert to int due to the NaN
cells not being recognised. I can't convert to datetime
without having the :SS
in all cells. Not sure what to do.
Upvotes: 0
Views: 244
Reputation: 11883
Here is an adjustment that converts to decimalized minutes, which should be appropriate for training:
import pandas as pd
import numpy as np
from datetime import datetime
col = pd.Series(['5:17', '05:30', np.NaN, '12:30:10'])
def to_minutes(t):
if pd.isna(t):
return t
if t.count(':')==2:
t = [int(s) for s in t.split(':')]
return 60*t[0] + t[1] + 1.0/60*t[2]
if t.count(':')==1:
t = [int(s) for s in t.split(':')]
return 60*t[0] + t[1]
col_converted = [to_minutes(t) for t in col]
print(col_converted)
Output:
[317, 330, nan, 750.1666666666666]
Regarding your comment that you just want to convert to an integer for scikit.learn, you MUST convert to a common set of units. You cannot just slam together converted integers. If you do, you have changed a continuous variable to an ordinal variable because you have destroyed the linearity. Unless you desire ordinal variable don't do this. For example, if we look at 10-minute increments across an hour boundary, you get something like this:
from matplotlib import pyplot as plt
col = pd.Series(['5:00', '05:10', '5:20', '5:30', '5:40', '5:50', '6:00', '6:10', '6:20',
'6:30', '6:40', '6:50'])
def to_minutes(t):
if pd.isna(t):
return t
if t.count(':')==2:
t = [int(s) for s in t.split(':')]
return 60*t[0] + t[1] + 1.0/60*t[2]
if t.count(':')==1:
t = [int(s) for s in t.split(':')]
return 60*t[0] + t[1]
def to_int(t):
return int(t.replace(':',''))
true_mins = [to_minutes(t) for t in col]
skewed_time = [to_int(t) for t in col]
print(true_mins)
print(skewed_time)
plt.plot(true_mins, skewed_time)
plt.xlabel('actual minutes')
plt.ylabel('bogus time')
plt.show()
Upvotes: 0
Reputation: 75080
IIUC, you could use;
final = df.assign(**pd.to_datetime(df[['pta','ptd']].stack()).dt.time.astype(str)
.str.replace(':','').astype(int).unstack())
pta ptd tpl_num
4 51700.0 51800.0 0
6 52930.0 53000.0 1
9 54200.0 54430.0 2
11 55300.0 55400.0 3
12 60300.0 60530.0 4
17 62430.0 NaN 5
Upvotes: 1
Reputation: 11883
I'm assuming you want to eventually get to datetime
objects.
You could make a helper function to convert the raw strings like:
import pandas as pd
import numpy as np
from datetime import datetime
col = pd.Series(['5:17', '05:30', np.NaN, '12:30:10'])
def date_converter(t):
if pd.isna(t):
return t
if t.count(':')==2:
return datetime.strptime(t, '%H:%M:%S')
if t.count(':')==1:
return datetime.strptime(t, '%H:%M')
else:
return None
col_converted = [date_converter(t) for t in col]
print(col_converted)
Output:
[datetime.datetime(1900, 1, 1, 5, 17), datetime.datetime(1900, 1, 1, 5, 30), nan, datetime.datetime(1900, 1, 1, 12, 30, 10)]
Upvotes: 0