Reputation: 11691
I'm trying to apply a function to all rows of a pandas DataFrame (actually just one column in that DataFrame)
I'm sure this is a syntax error but I'm know sure what I'm doing wrong
df['col'].apply(lambda x, y:(x - y).total_seconds(), args=[d1], axis=1)
The col
column contains a bunch a datetime.datetime
objects and d1
is the earliest of them. I'm trying to get a column of the total number of seconds for each of the rows.
I keep getting the following error
TypeError: <lambda>() got an unexpected keyword argument 'axis'
I don't understand why axis
is getting passed to my lambda
function
I've also tried doing
def diff_dates(d1, d2):
return (d1-d2).total_seconds()
df['col'].apply(diff_dates, args=[d1], axis=1)
And I get the same error.
Upvotes: 93
Views: 91467
Reputation: 394041
Note there is no axis
param for a Series.apply
call, as distinct to a DataFrame.apply
call.
Series.apply(func, convert_dtype=True, args=(), **kwds)
...
func : function
convert_dtype : boolean, default
True
- Try to find better dtype for elementwise function results. If
False
, leave as dtype=objectargs : tuple
- Positional arguments to pass to function in addition to the value
**kwds
- Additional keyword arguments passed to func.
There is one for a df but it's unclear how you're expecting this to work when you're calling it on a series but you're expecting it to work on a row?
Upvotes: 136
Reputation: 23131
A single column is (usually) a pandas Series, and as EdChum mentioned, DataFrame.apply
has axis
argument but Series.apply
hasn't, so apply
on axis=1
wouldn't work on columns.
The following works:
df['col'].apply(lambda x, y: (x - y).total_seconds(), args=(d1,))
For applying a function for each element in a row, map
can also be used:
df['col'].map(lambda x: (x - d1).total_seconds())
As apply
is just a syntactic sugar for a Python loop, a list comprehension may be more efficient than both of them because it doesn't have the pandas overhead:
[(x - d1).total_seconds() for x in df['col'].tolist()]
For a single column DataFrame, axis=1
may be passed:
df[['col']].apply(lambda x, y: (x - y).dt.total_seconds(), args=[d1], axis=1)
apply
if you canapply
is not even needed most of the time. For the case in the OP (and most other cases), a vectorized operation exists (just subtract d1
from the column - the value is broadcast to match the column) and is much faster than apply
anyway:
(df['col'] - d1).dt.total_seconds()
The vectorized subtraction is about 150 times faster than apply
on a column and over 7000 times faster than apply
on a single column DataFrame for a frame with 10k rows. As apply
is a loop, this gap gets bigger as the number of rows increase.
df = pd.DataFrame({'col': pd.date_range('2000', '2023', 10_000)})
d1 = df['col'].min()
%timeit df['col'].apply(lambda x, y: (x - y).total_seconds(), args=[d1])
# 124 ms ± 7.57 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['col'].map(lambda x: (x - d1).total_seconds())
# 127 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit [(x - d1).total_seconds() for x in df['col'].tolist()]
# 107 ms ± 4.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit (df['col'] - d1).dt.total_seconds()
# 851 µs ± 189 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%timeit df[['col']].apply(lambda x, y: (x - y).dt.total_seconds(), args=[d1], axis=1)
# 6.07 s ± 419 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1