Reputation: 3608
start = pd.to_datetime("2017-02-21 22:32:41",infer_datetime_format=True)
end = pd.to_datetime("2017-02-22 01:32:41",infer_datetime_format=True)
rng = pd.date_range(start.floor('h'), end.floor('h'), freq='h')
left = pd.Series(rng, index=rng ).clip_lower(start)
right = pd.Series(rng + 1, index=rng).clip_upper(end)
s = right - left
I get result as
2017-02-21 22:00:00 00:27:19
2017-02-21 23:00:00 01:00:00
2017-02-22 00:00:00 01:00:00
2017-02-22 01:00:00 00:32:41
I want to convert the result pandas.Series
to dataframe to make my result shown below
id |hour|day|minute|
+-----+----+---+------+
|10001| 22|Wed| 27|
|10001| 23|Thu| 60|
|10001| 00|Thu| 60|
|10001| 01|Thu| 32|
Any direct conversion option or should I have to loop through it ?
Upvotes: 2
Views: 103
Reputation: 294198
Option 1
Using np.core.defchararray.split
after using strftime
Followed up with assign
after using floor division on number of seconds
pd.DataFrame(
np.core.defchararray.split(s.index.strftime('%H %a')).tolist(),
columns=['hour', 'day']
).assign(minute=(s.dt.seconds // 60).values)
hour day minute
0 22 Tue 27
1 23 Tue 60
2 00 Wed 60
3 01 Wed 32
Option 2
Using a dictionaries in a list comprehension.
Note that I use Python 3.6 f-strings.
Otherwise use '{:02d}'.format(i.hour)
pd.DataFrame([dict(
hour=f'{i.hour:02d}',
day=i.strftime('%a'),
minute=v.seconds // 60
) for i, v in s.items()])
day hour minute
0 Tue 22 27
1 Tue 23 60
2 Wed 00 60
3 Wed 01 32
Option 3
And since the topic of speed came up, I wanted to offer another option that considers that.
a = np.array('Mon Tue Wed Thu Fri Sat Sun'.split())
pd.DataFrame(dict(
hour=s.index.hour.astype(str).str.zfill(2),
day=a[s.index.weekday],
minute=s.values.astype('timedelta64[m]').astype(int)
))
day hour minute
0 Tue 22 27
1 Tue 23 60
2 Wed 00 60
3 Wed 01 32
Note: I altered functions to ensure output was identical. Namely focusing on getting column order correct and Hour
column as a string.
def jez(s):
a = s.index.strftime('%H')
b = s.index.strftime('%a')
c = s.dt.floor('T').dt.total_seconds().div(60).astype(int)
return pd.DataFrame({'hour':a,'day':b,'minute':c.values},
columns=['hour','day','minute'])
def pir1(s):
return pd.DataFrame(
np.core.defchararray.split(s.index.strftime('%H %a')).tolist(),
columns=['hour', 'day']
).assign(minute=(s.dt.seconds // 60).values)
def pir2(s):
return pd.DataFrame([dict(
hour=f'{i.hour:02d}',
day=i.strftime('%a'),
minute=v.seconds // 60
) for i, v in s.items()], columns=['hour', 'day', 'minute'])
def pir3(s):
a = np.array('Mon Tue Wed Thu Fri Sat Sun'.split())
return pd.DataFrame(dict(
hour=s.index.hour.astype(str).str.zfill(2),
day=a[s.index.weekday],
minute=s.values.astype('timedelta64[m]').astype(int)
), columns=['hour', 'day', 'minute'])
res = pd.DataFrame(
np.nan,
[10, 30, 100, 300, 1000, 3000, 10000, 30000],
'jez pir1 pir2 pir3'.split()
)
for i in res.index:
start = pd.to_datetime("2007-02-21 22:32:41", infer_datetime_format=True)
rng = pd.date_range(start.floor('h'), periods=i, freq='h')
end = rng.max() + pd.to_timedelta("01:32:41")
left = pd.Series(rng, index=rng).clip_lower(start)
right = pd.Series(rng + 1, index=rng).clip_upper(end)
s = right - left
for j in res.columns:
stmt = f'{j}(s)'
setp = f'from __main__ import {j}, s'
res.at[i, j] = timeit(stmt, setp, number=100)
res.plot(loglog=True)
res.div(res.min(1), 0)
jez pir1 pir2 pir3
10 2.364757 1.922064 1.000000 1.124539
30 1.916160 2.092680 1.129115 1.000000
100 3.039881 3.361606 2.180457 1.000000
300 3.967504 5.025567 3.920143 1.000000
1000 7.106132 9.757840 7.607425 1.000000
3000 10.104004 14.741414 11.957978 1.000000
10000 10.522324 15.318158 13.262373 1.000000
30000 11.804760 16.718153 14.289628 1.000000
In the chart, you can see that jez
, pir1
and pir2
are all grouped together when plotted in log space. This tells us that their time is growing at the same order of magnitude. However, pir3
has a large separation and gets bigger over larger data. The time complexity of pir3
is smaller and indicates a much larger advantage.
This becomes more clear as we look at the table of multiples. Each row has a lowest value of 1
which indicates the fastest time. All other values in that row are multiples of time it took to accomplish the same task. In other words. The larger the value, the slower the method. As you can see, those multiples get larger over larger data. That means the advantage of pir3
gets better and better.
That is what better looks like. It is pointless to boast of 25% time improvements. Unless you have order of magnitude improvements, it isn't worth trying to convince the readers that an algorithm or approach is "better".
Upvotes: 5
Reputation: 862396
I think you need DatetimeIndex.strftime
for hours and days of week and for minutes from timedeltas use Series.dt.floor
+ Series.dt.total_seconds
:
a = s.index.strftime('%H')
b = s.index.strftime('%a')
c = s.dt.floor('T').dt.total_seconds().div(60).astype(int)
#alternative
#c = s.dt.total_seconds().floordiv(60).astype(int)
df = pd.DataFrame({'hour':a,'day':b,'minute':c.values},
columns=['hour','day','minute'])
print (df)
hour day minute
0 22 Tue 27
1 23 Tue 60
2 00 Wed 60
3 01 Wed 32
Upvotes: 1