Reputation: 615
Is there a possibility to work in the datatype of the object to which the apply function is applied? As I understand it, the dtype is changed.
Please see the following MWE. This result is not what I want to achieve.
import pandas as pd
ds_a = pd.Series([True,False,True])
ds_b = ds_a.apply(lambda x: ~x)
print(ds_a.dtype == ds_b.dtype)
print(ds_b.dtype)
results in:
False
int64
ds_b
should be the same dtype (boolean) as ds_a
. I am interested in how to prevent any data type change.
EDIT: Here is a better MWE for my use-case.
Please see the following (new) MWE.
import pandas as pd
ds_a = pd.Series([True,False,True,True,True,False])
ds_mask = pd.Series([True,False])
func = lambda x: pd.np.all(x==ds_mask)
ds_b = ds_a.rolling(len(ds_mask)).apply(func, raw=True)
print(a(ds_a[:2]).dtype)
print(ds_b.dtype)
results in:
dtype('bool')
float64
Upvotes: 2
Views: 577
Reputation: 1014
So the issue is not necessarily that the DataFrame is casting the values. The issue is that the bitwise complement operator ~
is being used as opposed to the logical not
operator. This is causing the booleans of True
and False
to be treated as integers, resulting in the following:
~True = -2
~False = -1
This is what is causing the output DataFrame ds_b
to show a dtype
of int64
. Changing the code to the following should resolve that issue.
import pandas as pd
ds_a = pd.Series([True,False,True])
ds_b = ds_a.apply(lambda x: not x)
print(ds_a.dtype == ds_b.dtype)
print(ds_b.dtype)
However, you are correct that the apply
method will make adjustments to the type of the series based on the input. For example, in your case, it converted int
to int64
. If you come across this behavior in the future and it is undesired, consider the following code.
ds_b = ds_a.apply(lambda x: ~x, convert_dtype=False).astype(ds_a.dtype)
This prevents apply
from doing automatic conversions, and at the end it converts the dtype
from object
to the original type. Here are some timings for you to compare, it does not introduce a significant amount of overhead.
In [26]: %timeit ds_b = ds_a.apply(lambda x: ~x)
257 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [27]: %timeit ds_b = ds_a.apply(lambda x: ~x).astype(ds_a.dtype)
394 µs ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [28]: %timeit ds_b = ds_a.apply(lambda x: ~x, convert_dtype=False).astype(ds_
...: a.dtype)
359 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In your latest example, the Rolling
instance automatically tries to handle data as float64
. It is more of a limitation of using rolling
than it is using a Series or DataFrame apply
. As it stands, there is no way to change the datatype for rolling operations within Pandas besides casting the results at the end. For this, I would see the code above for casting the dtype
at the end, just omit the convert_dtype
parameter for the Rolling
object's apply
method since it is not applicable.
If you are open to using packages other than Pandas, a rolling function can be implemented using numpy. See the following code:
import numpy as np
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
a = np.array([ True, False, True, True, True, False])
mask = np.array([True, False])
b = (rolling_window(a, 2) == mask).all(axis=1, keepdims=True)
After execution, b
is equal to the expected output for your second MVE, except it is in the form of an numpy array.
array([[ True],
[False],
[False],
[False],
[ True]])
Upvotes: 3
Reputation: 1629
Just add the explicit conversion to boolean
in the lambda
you are applying
import pandas as pd
ds_a = pd.Series([True,False,True])
ds_b = ds_a.apply(lambda x: bool(~x))
print(ds_a.dtype == ds_b.dtype)
print(ds_b.dtype)
Upvotes: 1