Reputation: 5345
This works already, but I want to optimize a bit:
df['Total Time'] = df['Total Time'].str.split(':').apply(lambda x: (int(x[0])*60.0) + int(x[1]) + (int(x[2]) / 60.0))
I am taking a timestamp (string) in Excel which represents Hours:Minutes:Seconds and turning it into a float which represents minutes. This is easier for me to play around with compared to the string.
I have to do that for 15 columns though. I have read that lambda is a 'throwaway' function. Is it better to define a function (to split and then to turn the values into minutes) and apply it to each column instead? Would I loop through the columns (not all columns, but 15 of about 30)?
On a separate file, I have another technique which is performing the exact same task (turning a timestamp into minutes):
df2['Total Time']= pd.to_timedelta(df2['Total Time'])
df2['Total Time'] = df2['Total Time'] / pd.offsets.Minute(1)
Is one method inherently better or faster than the other?
Edit: I was getting an error when I tried the timedelta method.It looks like some timestamp strings have incorrect values which are not being interpreted correctly.
I am hoping someone will provide a function I can use to apply/map everything to several columns at once efficiently. I am not too familiar with def and return functions (I learned basic pandas before I learned basic python, just for practical purposes.. slowly learning though). Need to turn those timestamps into minutes
ValueError: cannot create timedelta string converter for [09:38:010]
Upvotes: 0
Views: 631
Reputation: 128918
At the moment, the idomatic (but more general soln) is actually slower, see the issue here
In [28]: pd.set_option('max_rows',12)
In [29]: s = Series(pd.timedelta_range('0',freq='s',periods=10000).format())
In [30]: s
Out[30]:
0 00:00:00
1 00:00:01
2 00:00:02
3 00:00:03
4 00:00:04
5 00:00:05
...
9994 02:46:34
9995 02:46:35
9996 02:46:36
9997 02:46:37
9998 02:46:38
9999 02:46:39
dtype: object
Works well on a given regular format
In [31]: %timeit s.str.split(':').apply(lambda x: (int(x[0])*60.0) + int(x[1]) + (int(x[2]) / 60.0))
10 loops, best of 3: 25.2 ms per loop
This will parse almost anything (you can set coerce=True
to force invalid values to NaT
),
but is doing regex-matching, so somewhat slower.
In [32]: %timeit pd.to_timedelta(s) / pd.offsets.Minute()
10 loops, best of 3: 82.8 ms per loop
So looking for a volunteer to write the timedelta parser in c/cython to greatly improve this perf.
Upvotes: 1