TimK
TimK

Reputation: 615

How to prevent .apply to change dtype of boolean panda Series

Is there a possibility to work in the datatype of the object to which the apply function is applied? As I understand it, the dtype is changed.

Please see the following MWE. This result is not what I want to achieve.

import pandas as pd
ds_a = pd.Series([True,False,True])
ds_b = ds_a.apply(lambda x: ~x)
print(ds_a.dtype == ds_b.dtype)
print(ds_b.dtype)

results in:

False
int64

ds_b should be the same dtype (boolean) as ds_a. I am interested in how to prevent any data type change.

EDIT: Here is a better MWE for my use-case.

Please see the following (new) MWE.

import pandas as pd
ds_a = pd.Series([True,False,True,True,True,False])
ds_mask = pd.Series([True,False])
func = lambda x: pd.np.all(x==ds_mask)
ds_b = ds_a.rolling(len(ds_mask)).apply(func, raw=True)
print(a(ds_a[:2]).dtype)
print(ds_b.dtype)

results in:

dtype('bool')
float64

Upvotes: 2

Views: 577

Answers (2)

James Mchugh
James Mchugh

Reputation: 1014

So the issue is not necessarily that the DataFrame is casting the values. The issue is that the bitwise complement operator ~ is being used as opposed to the logical not operator. This is causing the booleans of True and False to be treated as integers, resulting in the following:

~True = -2
~False = -1

This is what is causing the output DataFrame ds_b to show a dtype of int64. Changing the code to the following should resolve that issue.

import pandas as pd


ds_a = pd.Series([True,False,True])
ds_b = ds_a.apply(lambda x: not x)
print(ds_a.dtype == ds_b.dtype)
print(ds_b.dtype)

However, you are correct that the apply method will make adjustments to the type of the series based on the input. For example, in your case, it converted int to int64. If you come across this behavior in the future and it is undesired, consider the following code.

ds_b = ds_a.apply(lambda x: ~x, convert_dtype=False).astype(ds_a.dtype)

This prevents apply from doing automatic conversions, and at the end it converts the dtype from object to the original type. Here are some timings for you to compare, it does not introduce a significant amount of overhead.

In [26]: %timeit ds_b = ds_a.apply(lambda x: ~x)                                
257 µs ± 5.67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [27]: %timeit ds_b = ds_a.apply(lambda x: ~x).astype(ds_a.dtype)             
394 µs ± 23.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [28]: %timeit ds_b = ds_a.apply(lambda x: ~x, convert_dtype=False).astype(ds_
    ...: a.dtype)                                                               
359 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In your latest example, the Rolling instance automatically tries to handle data as float64. It is more of a limitation of using rolling than it is using a Series or DataFrame apply. As it stands, there is no way to change the datatype for rolling operations within Pandas besides casting the results at the end. For this, I would see the code above for casting the dtype at the end, just omit the convert_dtype parameter for the Rolling object's apply method since it is not applicable.

If you are open to using packages other than Pandas, a rolling function can be implemented using numpy. See the following code:

import numpy as np

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

a = np.array([ True, False,  True,  True,  True, False])
mask = np.array([True, False])

b = (rolling_window(a, 2) == mask).all(axis=1, keepdims=True)

After execution, b is equal to the expected output for your second MVE, except it is in the form of an numpy array.

array([[ True],
       [False],
       [False],
       [False],
       [ True]])

Upvotes: 3

Max Voitko
Max Voitko

Reputation: 1629

Just add the explicit conversion to boolean in the lambda you are applying

import pandas as pd


ds_a = pd.Series([True,False,True])
ds_b = ds_a.apply(lambda x: bool(~x))
print(ds_a.dtype == ds_b.dtype)
print(ds_b.dtype)

Upvotes: 1

Related Questions