trench
trench

Reputation: 5345

Defining a function (pandas)

This works already, but I want to optimize a bit:

df['Total Time'] = df['Total Time'].str.split(':').apply(lambda x: (int(x[0])*60.0) + int(x[1]) + (int(x[2]) / 60.0))

I am taking a timestamp (string) in Excel which represents Hours:Minutes:Seconds and turning it into a float which represents minutes. This is easier for me to play around with compared to the string.

I have to do that for 15 columns though. I have read that lambda is a 'throwaway' function. Is it better to define a function (to split and then to turn the values into minutes) and apply it to each column instead? Would I loop through the columns (not all columns, but 15 of about 30)?

On a separate file, I have another technique which is performing the exact same task (turning a timestamp into minutes):

df2['Total Time']= pd.to_timedelta(df2['Total Time'])
df2['Total Time'] = df2['Total Time'] / pd.offsets.Minute(1)

Is one method inherently better or faster than the other?

Edit: I was getting an error when I tried the timedelta method.It looks like some timestamp strings have incorrect values which are not being interpreted correctly.

I am hoping someone will provide a function I can use to apply/map everything to several columns at once efficiently. I am not too familiar with def and return functions (I learned basic pandas before I learned basic python, just for practical purposes.. slowly learning though). Need to turn those timestamps into minutes

ValueError: cannot create timedelta string converter for [09:38:010]

Upvotes: 0

Views: 631

Answers (1)

Jeff
Jeff

Reputation: 128918

At the moment, the idomatic (but more general soln) is actually slower, see the issue here

In [28]: pd.set_option('max_rows',12)

In [29]: s = Series(pd.timedelta_range('0',freq='s',periods=10000).format())

In [30]: s
Out[30]: 
0       00:00:00
1       00:00:01
2       00:00:02
3       00:00:03
4       00:00:04
5       00:00:05
          ...   
9994    02:46:34
9995    02:46:35
9996    02:46:36
9997    02:46:37
9998    02:46:38
9999    02:46:39
dtype: object

Works well on a given regular format

In [31]: %timeit s.str.split(':').apply(lambda x: (int(x[0])*60.0) + int(x[1]) + (int(x[2]) / 60.0))
10 loops, best of 3: 25.2 ms per loop

This will parse almost anything (you can set coerce=True to force invalid values to NaT), but is doing regex-matching, so somewhat slower.

In [32]: %timeit pd.to_timedelta(s) / pd.offsets.Minute()
10 loops, best of 3: 82.8 ms per loop

So looking for a volunteer to write the timedelta parser in c/cython to greatly improve this perf.

Upvotes: 1

Related Questions